Multivariate Normal with missing inputs

aayushy · November 27, 2018, 3:23pm

What would the right way to model a Multivariate Normal distribution be when on every observation some values are missing, ie., I have a very sparse input. Note that we

For example, my observations may look something like

[[1.11, --, 0.9, --, 1.23, --], 
 [--, 1.2, 0.81, --,  --, --],
 [--, --, --, --, 1.21, --], 
... ]

I was initially letting PyMC handle it, but I realise now that that isn’t the right way because the samples are drawn iid.

It has been mentioned in a few places on the forums that pm.Potential can be used in such cases but what exactly is a potential in this context and how might I use it here?

junpenglao · November 27, 2018, 4:37pm

See: How to marginalize on a Multivariate Normal distribution

aayushy · November 27, 2018, 5:08pm

Thanks for your reply! Indeed your response on the post was super helpful, it really helped me think in the right direction. I did get how your notebook worked around the problem I just mentioned and I should be able to write it with pm.Potential.

If you could perhaps point me to some literature on what this potential actually means (the closest definition I have found so far is of a factor potential in Markov Random Fields) I’d be grateful.

junpenglao · November 27, 2018, 5:43pm

Oh it’s much simpler than that. In PyMC3, potential are things you want to add to the model logp. For example, if you write

with pm.Model() as m1:
    ...
    obs = pm.MvNormal('obs', mu, cov, observed=Xdata)

What pymc3 does within the model m1 is that the logp of obs will be summed along with other logp of Random Variables to get the model logp, which means it is the same as:

with pm.Model() as m2:
    ... # same code as in m1
    obs = pm.Potential('obs', pm.MvNormal.dist(mu, cov).logp(Xdata))

where obs (now a theano tensor representing the observed logp) is added to the model logp similarly.

aayushy · November 27, 2018, 6:35pm

Ah alright, that makes sense. The name was a little misleading, so I dug into some old PyMC2 documentation to find that it was indeed named after factor potentials (though not quite the same one).

aayushy · November 28, 2018, 1:33pm

@junpenglao, a related question about the following line from your notebook:

for i, im in enumerate(uni_case):
    pm.MvNormal('obs%i'%(i), 
                mu[im == False],
                cov[im == False, :][:, im == False],
                observed=X_slice[np.sum(maskall == im, axis=1) > 0, :][:, im == False])

To observed we pass a sub-array (by which I mean an array of shape smaller than the full dataset), and mu and sigma are themselves sub-tensors. How then does our model know which marginal we’re working with?

To be more specific:

Say one of the observations is [--, .5, 1., --, .9] so the marginals I want to work with will be {X2, X3, X5}. observed gets [[.5, 1., .9]] and mu and sigma passed to MvNormal are sub-tensors. How does the model decide that these observations correspond to X2, X3 and X5 as opposed to, say, X1, X2 and X3 since we are not explicitly passing any mask to it?

junpenglao · November 28, 2018, 1:42pm

The free parameter mu and cov is indexed as well (eg mu[im == False]), so the MvNormal is now a 3*3 instead of 5*5

aayushy · December 26, 2018, 10:40am

@junpenglao, it seems that with the current model (using unique masks for the data) it is impossible to generalize the distribution to conditionals that do not occur in the dataset. Intuitively, this isn’t accurate because the right way of looking at this is as a generative process and a truly Bayesian method would allow for sampling from the conditionals not seen in the dataset.

Is there a possibility of generalizing the existing model to unseen conditional distributions? Maybe a way to model X_missing as part of the generative process rather than have it drawn iid?

junpenglao · December 26, 2018, 3:20pm

Not sure I got what you mean - once you sample from your model, you have the full posterior of mu and cov of the MvNormal, which is your target generative process - you can then compute the conditional MvNormal on each slice of the MCMC samples and generate for unseen data.

aayushy · December 27, 2018, 8:12am

You’re right, I was thinking of something else (which was, in fact, incorrect).

Topic		Replies	Views
Dealing with partially observed multivariate Normal values Questions	3	1670	January 7, 2019
How to marginalize on a Multivariate Normal distribution Questions	13	8111	November 23, 2018
Handling missing values in predictor when outcome is a Multivariate Normal distribution v5	7	108	October 25, 2024
Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,) Questions theano	12	1875	September 7, 2020
Calculating conditional posterior predictive samples in high-dimensional observation spaces v5 modeling	13	744	July 7, 2023

Multivariate Normal with missing inputs

Related topics