Dealing with partially observed multivariate Normal values

Woger_W · January 4, 2019, 4:55pm

Consider the model

Y: Observed variables following a multivariate Normal distribution with shape (N, K) i.e. N observations with dimension K.
mu: A vector representing the mean of the multivariate Normal, with shape (K,).
cov: A matrix representing the covariance of the multivariate Normal, with shape (K, K).

The issue I have is that for some observations, the Y[i, :] vector is not fully observed e.g. we do not observed the final value Y[i, -1]. We still know that

Y[i, :-1] ~ MVN(mu[:-1], cov[:-1, :-1]

so the model should technically still work mathematically. How do I deal with the fact there can be unknown values in the observed varaible when coding a pymc3 model?

Here is my code in the case where it is assumed there are no missing observed values.

K = Y.shape[1]

with pm.Model() as model:
    # Hyper-parameters.  
    sd_Sigma = pm.HalfCauchy.dist(beta=2.5, shape=K)
    
    # Priors.
    mu = pm.Normal('mu_alpha', 0, 1e4, shape=K)
    
    L_packed = pm.LKJCholeskyCov('packed_L', n=K, eta=2., sd_dist=sd_Sigma)
    L = pm.expand_packed_triangular(K, L_packed)

    # Define likelihood.
    obs = pm.MvNormal('obs', mu, chol=L, observed=Y)

with model:
    trace = pm.sample(draws=3000, cores=1, n_init=1000)

junpenglao · January 5, 2019, 3:13am

You can have a look at the solution in How to marginalize on a Multivariate Normal distribution. Also see a related discussion here: Multivariate Normal with missing inputs

Woger_W · January 7, 2019, 12:32pm

This doesn’t quite fit with my problem unfortunately. The unique issue is that for each observation vector, some of the values may be missing.

Hence for these observations, the likelihood only depends on a submatrix of the covariance and a sub vector of the mean.

I imagine I could have a for loop that defined a unique likelihood for each observation but was worried this may cause a big performance hit.

junpenglao · January 7, 2019, 3:33pm

If they are all unique, then yes you need to use something like a for-loop to build the logp for each vector - the solution above is for when you have multiple slices with the same imputations but it also use a for loop.
For loop in the logp usually is not too bad performance wise, it would be a bit slow during model set up, but it will only be called a couple times to build the tensor representation.

Topic		Replies	Views
What does the hierarchical model look like when having missing in observed? Questions	12	1156	October 31, 2018
Multivariate Normal with missing inputs Questions	9	1525	December 27, 2018
How to marginalize on a Multivariate Normal distribution Questions	13	8131	November 23, 2018
Sampling semantics of multiple observed variables Questions	5	2579	April 25, 2019
Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,) Questions theano	12	1882	September 7, 2020

Dealing with partially observed multivariate Normal values

Related topics