Dealing with imputation of left censored multivariate normal distribution

Dear All,

I’m struggling with left censored multivariate normal distribution by trying to generalize the example
http://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/censored_data.py for multivariate case.

Let’s consider a toy example. I have a sample of a 3-variate standard normal distribution with dimensions Nx3.
Assume that we define some threshold for one of the variates and mask all values which fall below.
Hence we get a left censored distribution for one of the variates. My aim is to impute this censored values.
We can think about these values as missing values but bounded to (-inf, threshold]
We don’t know the original variance-covariance matrix but know that mean vector is [0., 0., 0.] due to standard normality.

Here is my attempt:

import numpy as np
import pymc3 as pm

np.random.seed(42)
n = 10000
threshold = -0.1

# Here I define original Mean and Var-Covar Matrix
cov_matrix_original = np.array([[1, 0.8, 0.9], [0.8, 1., 0.75], [0.9, 0.75, 1.]])
Y_mu = np.array([0., 0., 0.])

# Generating sample n x 3
Y = np.random.multivariate_normal(mean=Y_mu, cov=cov_matrix_original, size=n)

# Do "censoring" for 3rd variate
mask_col = Y[:, -1] <= threshold
mask = np.zeros(Y.shape, dtype=int)
mask[mask_col, -1] = 1

Y_masked = np.ma.masked_array(Y, mask=mask)

# Censored model
with pm.Model():
    packed_L = pm.LKJCholeskyCov('packed_L', n=3, eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L = pm.expand_packed_triangular(3, packed_L)
    Sigma = pm.Deterministic('Sigma', L.dot(L.T))    

    left_censored = pm.Bound(pm.MvNormal, upper=threshold)('left_censored', mu=Y_mu, chol=L, shape=3)
    Data = pm.MvNormal('Data', mu=Y_mu, chol=L, observed=Y_masked)

    trace = pm.sample(njobs=1)

Is it at all a right approach to do imputation? If so, where should I look for imputed values? Data_missing or left_censored?
After sampling I saw that many values in Data_missing are greater than the threshold.

Unfortunately, from the documentation for me, it’s totally not clear how PyMC3 handles missing values. What is the internal mechanism?
Does it exploit the known property of multivariate normal conditional distribution? (http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions)
And what should be an approach to deal with imputation when all three variates are censored with some particular threshold?

I’d really appreciate any help and advice.
Thanks
Alex

1 Like

Multivariate imputation is usually quite complicated, as the missing value makes the logp evaluation difficult - For example if you have observed [a, b, x] ~ MvNormal(mu, cov) with x being the missing value, you need to derive the marginal distribution of [a, b] and the conditional of x |(a,b). So in this case, I dont think using Bound variable is the correct way to go, and the missing value in PyMC3 is not build to handle such cases.

A side note on how PyMC3 handles missing values: in short it create a new random variable where it’s distribution follows the correct conditional distribution.


However, it only works for iid cases where the missing values do not depend on other observation (again it is not the case here)

In your use case, what I would do is treat the censored value as none-censored, and then account for the censoring using pm.Potential.

Thanks @junpenglao for your valuable response!
But I’m a bit confused with:

In your use case, what I would do is treat the censored value as none-censored, and then account for the censoring using pm.Potential.

What does it mean? In my real case, censored values are censored and I know only they lie below the threshold. Also, I know that expectations are zeros but I don’t know the variance-covariance matrix…Could you please clarify a bit more?

Thanks in advance,
Alex

If I understand correctly, in your case one of the dimension the observed MvNormal cannot take value below a bound, meaning that if the actually value is below the threshold, it will just return the threshold value?

Yes, correct. And my goal to impute (or generate, or sample) each of these values which follow a conditional distribution

p(x_i | a_i, b_i)

where i is an index of sample’s row. But in order to apply conditional distribution for MVN I have to know variance-covariance matrix, but I don’t.

The trick is to treat the cov matrix as known (i.e., Sigma in your model) and compute the conditional, I am trying to follow conditional distribution from wikipedia but the code is not quite work yet:

import numpy as np
import pymc3 as pm
import theano.tensor as tt

np.random.seed(42)
n = 10000
threshold = -0.15

# Here I define original Mean and Var-Covar Matrix
cov_matrix_original = np.array([[1, 0.8, 0.9], [0.8, 1., 0.75], [0.9, 0.75, 1.]])
Y_mu = np.array([0., 0., 0.])

# Generating sample n x 3
Y = np.random.multivariate_normal(mean=Y_mu, cov=cov_matrix_original, size=n)

# Do "censoring" for 3rd variate
mask_col = Y[:, -1] <= threshold
Y[mask_col, -1] = threshold

# compute the conditional CDF to account for the censoring
def mv_cond(idx, nd, mu, cov, cond_a):
    idx1 = np.arange(nd) != idx
    idx2 = np.arange(nd) == idx
    mu_1 = mu[idx1]
    mu_2 = mu[idx2]
    S11 = cov[idx1, idx1]
    S12 = cov[idx1, idx2]
    S21 = cov[idx2, idx1]
    S22 = cov[idx1, :][:, idx1]
    mu_ = mu_1 + tt.dot(S12.dot(tt.nlinalg.matrix_inverse(S22)), cond_a-mu_2)
    sd_ = S11 - tt.dot(S12.dot(tt.nlinalg.matrix_inverse(S22)), S21)
    return mu_, sd_

def normal_lcdf(mu, sigma, x):
    z = (x - mu) / sigma
    return tt.switch(
        tt.lt(z, -1.0),
        tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)
    )

# Censored model
with pm.Model():
    packed_L = pm.LKJCholeskyCov(
        'packed_L', n=3, eta=2., sd_dist=pm.HalfNormal.dist(2.5))
    L = pm.expand_packed_triangular(3, packed_L)
    Sigma = pm.Deterministic('Sigma', L.dot(L.T))

    Data = pm.MvNormal('Data', mu=Y_mu, chol=L, observed=Y)
    mu_c, sd_c = mv_cond(2, 3, Y_mu, Sigma, Y_mu[:-1])
    pm.Potential('censor_cdf', normal_lcdf(mu_c, sd_c, threshold) * mask_col.sum())
    trace = pm.sample()
1 Like

The estimation of the covariance matrix is still off. I dont have time to look into it further but I imagine there should be a better implementation to find the conditional CDF (currently using matrix inverse which can be quite slow)

Thank you very much for the instructive reply. I’ll keep to dig into. Regards.

1 Like