Dear All,
I’m struggling with left censored multivariate normal distribution by trying to generalize the example
http://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/censored_data.py for multivariate case.
Let’s consider a toy example. I have a sample of a 3-variate standard normal distribution with dimensions Nx3.
Assume that we define some threshold for one of the variates and mask all values which fall below.
Hence we get a left censored distribution for one of the variates. My aim is to impute this censored values.
We can think about these values as missing values but bounded to (-inf, threshold]
We don’t know the original variance-covariance matrix but know that mean vector is [0., 0., 0.] due to standard normality.
Here is my attempt:
import numpy as np
import pymc3 as pm
np.random.seed(42)
n = 10000
threshold = -0.1
# Here I define original Mean and Var-Covar Matrix
cov_matrix_original = np.array([[1, 0.8, 0.9], [0.8, 1., 0.75], [0.9, 0.75, 1.]])
Y_mu = np.array([0., 0., 0.])
# Generating sample n x 3
Y = np.random.multivariate_normal(mean=Y_mu, cov=cov_matrix_original, size=n)
# Do "censoring" for 3rd variate
mask_col = Y[:, -1] <= threshold
mask = np.zeros(Y.shape, dtype=int)
mask[mask_col, -1] = 1
Y_masked = np.ma.masked_array(Y, mask=mask)
# Censored model
with pm.Model():
packed_L = pm.LKJCholeskyCov('packed_L', n=3, eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
L = pm.expand_packed_triangular(3, packed_L)
Sigma = pm.Deterministic('Sigma', L.dot(L.T))
left_censored = pm.Bound(pm.MvNormal, upper=threshold)('left_censored', mu=Y_mu, chol=L, shape=3)
Data = pm.MvNormal('Data', mu=Y_mu, chol=L, observed=Y_masked)
trace = pm.sample(njobs=1)
Is it at all a right approach to do imputation? If so, where should I look for imputed values? Data_missing or left_censored?
After sampling I saw that many values in Data_missing are greater than the threshold.
Unfortunately, from the documentation for me, it’s totally not clear how PyMC3 handles missing values. What is the internal mechanism?
Does it exploit the known property of multivariate normal conditional distribution? (http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions)
And what should be an approach to deal with imputation when all three variates are censored with some particular threshold?
I’d really appreciate any help and advice.
Thanks
Alex