Dealing with partially observed multivariate Normal values


Consider the model

Y: Observed variables following a multivariate Normal distribution with shape (N, K) i.e. N observations with dimension K.
mu: A vector representing the mean of the multivariate Normal, with shape (K,).
cov: A matrix representing the covariance of the multivariate Normal, with shape (K, K).

The issue I have is that for some observations, the Y[i, :] vector is not fully observed e.g. we do not observed the final value Y[i, -1]. We still know that

Y[i, :-1] ~ MVN(mu[:-1], cov[:-1, :-1]

so the model should technically still work mathematically. How do I deal with the fact there can be unknown values in the observed varaible when coding a pymc3 model?

Here is my code in the case where it is assumed there are no missing observed values.

K = Y.shape[1]

with pm.Model() as model:
    # Hyper-parameters.  
    sd_Sigma = pm.HalfCauchy.dist(beta=2.5, shape=K)
    # Priors.
    mu = pm.Normal('mu_alpha', 0, 1e4, shape=K)
    L_packed = pm.LKJCholeskyCov('packed_L', n=K, eta=2., sd_dist=sd_Sigma)
    L = pm.expand_packed_triangular(K, L_packed)

    # Define likelihood.
    obs = pm.MvNormal('obs', mu, chol=L, observed=Y)

with model:
    trace = pm.sample(draws=3000, cores=1, n_init=1000)


You can have a look at the solution in How to marginalize on a Multivariate Normal distribution. Also see a related discussion here: Multivariate Normal with missing inputs


This doesn’t quite fit with my problem unfortunately. The unique issue is that for each observation vector, some of the values may be missing.

Hence for these observations, the likelihood only depends on a submatrix of the covariance and a sub vector of the mean.

I imagine I could have a for loop that defined a unique likelihood for each observation but was worried this may cause a big performance hit.


If they are all unique, then yes you need to use something like a for-loop to build the logp for each vector - the solution above is for when you have multiple slices with the same imputations but it also use a for loop.
For loop in the logp usually is not too bad performance wise, it would be a bit slow during model set up, but it will only be called a couple times to build the tensor representation.