Multivariate Normal with missing inputs

What would the right way to model a Multivariate Normal distribution be when on every observation some values are missing, ie., I have a very sparse input. Note that we

For example, my observations may look something like

[[1.11, --, 0.9, --, 1.23, --], 
 [--, 1.2, 0.81, --,  --, --],
 [--, --, --, --, 1.21, --], 
... ] 

I was initially letting PyMC handle it, but I realise now that that isn’t the right way because the samples are drawn iid.

It has been mentioned in a few places on the forums that pm.Potential can be used in such cases but what exactly is a potential in this context and how might I use it here?

See: How to marginalize on a Multivariate Normal distribution :smile:

Thanks for your reply! Indeed your response on the post was super helpful, it really helped me think in the right direction. I did get how your notebook worked around the problem I just mentioned and I should be able to write it with pm.Potential.

If you could perhaps point me to some literature on what this potential actually means (the closest definition I have found so far is of a factor potential in Markov Random Fields) I’d be grateful.

Oh it’s much simpler than that. In PyMC3, potential are things you want to add to the model logp. For example, if you write

with pm.Model() as m1:
    obs = pm.MvNormal('obs', mu, cov, observed=Xdata)

What pymc3 does within the model m1 is that the logp of obs will be summed along with other logp of Random Variables to get the model logp, which means it is the same as:

with pm.Model() as m2:
    ... # same code as in m1
    obs = pm.Potential('obs', pm.MvNormal.dist(mu, cov).logp(Xdata))

where obs (now a theano tensor representing the observed logp) is added to the model logp similarly.

1 Like

Ah alright, that makes sense. The name was a little misleading, so I dug into some old PyMC2 documentation to find that it was indeed named after factor potentials (though not quite the same one).

@junpenglao, a related question about the following line from your notebook:

for i, im in enumerate(uni_case):
                mu[im == False],
                cov[im == False, :][:, im == False],
                observed=X_slice[np.sum(maskall == im, axis=1) > 0, :][:, im == False])

To observed we pass a sub-array (by which I mean an array of shape smaller than the full dataset), and mu and sigma are themselves sub-tensors. How then does our model know which marginal we’re working with?

To be more specific:

Say one of the observations is [--, .5, 1., --, .9] so the marginals I want to work with will be {X2, X3, X5}. observed gets [[.5, 1., .9]] and mu and sigma passed to MvNormal are sub-tensors. How does the model decide that these observations correspond to X2, X3 and X5 as opposed to, say, X1, X2 and X3 since we are not explicitly passing any mask to it?

The free parameter mu and cov is indexed as well (eg mu[im == False]), so the MvNormal is now a 3*3 instead of 5*5

1 Like

@junpenglao, it seems that with the current model (using unique masks for the data) it is impossible to generalize the distribution to conditionals that do not occur in the dataset. Intuitively, this isn’t accurate because the right way of looking at this is as a generative process and a truly Bayesian method would allow for sampling from the conditionals not seen in the dataset.

Is there a possibility of generalizing the existing model to unseen conditional distributions? Maybe a way to model X_missing as part of the generative process rather than have it drawn iid?

Not sure I got what you mean - once you sample from your model, you have the full posterior of mu and cov of the MvNormal, which is your target generative process - you can then compute the conditional MvNormal on each slice of the MCMC samples and generate for unseen data.

You’re right, I was thinking of something else (which was, in fact, incorrect).