Dealing with imputation of left censored multivariate normal distribution

Kyrish · June 19, 2018, 4:32pm

Dear All,

I’m struggling with left censored multivariate normal distribution by trying to generalize the example
http://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/censored_data.py for multivariate case.

Let’s consider a toy example. I have a sample of a 3-variate standard normal distribution with dimensions Nx3.
Assume that we define some threshold for one of the variates and mask all values which fall below.
Hence we get a left censored distribution for one of the variates. My aim is to impute this censored values.
We can think about these values as missing values but bounded to (-inf, threshold]
We don’t know the original variance-covariance matrix but know that mean vector is [0., 0., 0.] due to standard normality.

Here is my attempt:

import numpy as np
import pymc3 as pm

np.random.seed(42)
n = 10000
threshold = -0.1

# Here I define original Mean and Var-Covar Matrix
cov_matrix_original = np.array([[1, 0.8, 0.9], [0.8, 1., 0.75], [0.9, 0.75, 1.]])
Y_mu = np.array([0., 0., 0.])

# Generating sample n x 3
Y = np.random.multivariate_normal(mean=Y_mu, cov=cov_matrix_original, size=n)

# Do "censoring" for 3rd variate
mask_col = Y[:, -1] <= threshold
mask = np.zeros(Y.shape, dtype=int)
mask[mask_col, -1] = 1

Y_masked = np.ma.masked_array(Y, mask=mask)

# Censored model
with pm.Model():
    packed_L = pm.LKJCholeskyCov('packed_L', n=3, eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L = pm.expand_packed_triangular(3, packed_L)
    Sigma = pm.Deterministic('Sigma', L.dot(L.T))    

    left_censored = pm.Bound(pm.MvNormal, upper=threshold)('left_censored', mu=Y_mu, chol=L, shape=3)
    Data = pm.MvNormal('Data', mu=Y_mu, chol=L, observed=Y_masked)

    trace = pm.sample(njobs=1)

Is it at all a right approach to do imputation? If so, where should I look for imputed values? Data_missing or left_censored?
After sampling I saw that many values in Data_missing are greater than the threshold.

Unfortunately, from the documentation for me, it’s totally not clear how PyMC3 handles missing values. What is the internal mechanism?
Does it exploit the known property of multivariate normal conditional distribution? (http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions)
And what should be an approach to deal with imputation when all three variates are censored with some particular threshold?

I’d really appreciate any help and advice.
Thanks
Alex

junpenglao · June 20, 2018, 5:33am

Multivariate imputation is usually quite complicated, as the missing value makes the logp evaluation difficult - For example if you have observed [a, b, x] ~ MvNormal(mu, cov) with x being the missing value, you need to derive the marginal distribution of [a, b] and the conditional of x |(a,b). So in this case, I dont think using Bound variable is the correct way to go, and the missing value in PyMC3 is not build to handle such cases.

A side note on how PyMC3 handles missing values: in short it create a new random variable where it’s distribution follows the correct conditional distribution.

github.com

pymc-devs/pymc3/blob/70774559e686334c31e025b496d9c9a713a593f5/pymc3/model.py#L1254-L1266


if hasattr(data, 'mask'):
    from .distributions import NoDistribution
    testval = np.broadcast_to(distribution.default(), data.shape)[data.mask]
    fakedist = NoDistribution.dist(shape=data.mask.sum(), dtype=dtype,
                                   testval=testval, parent_dist=distribution)
    missing_values = FreeRV(name=name + '_missing', distribution=fakedist,
                            model=model)
    constant = tt.as_tensor_variable(data.filled())


    dataTensor = tt.set_subtensor(
        constant[data.mask.nonzero()], missing_values)
    dataTensor.missing_values = missing_values
    return dataTensor

However, it only works for iid cases where the missing values do not depend on other observation (again it is not the case here)

In your use case, what I would do is treat the censored value as none-censored, and then account for the censoring using pm.Potential.

Kyrish · June 20, 2018, 8:39am

Thanks @junpenglao for your valuable response!
But I’m a bit confused with:

In your use case, what I would do is treat the censored value as none-censored, and then account for the censoring using pm.Potential.

What does it mean? In my real case, censored values are censored and I know only they lie below the threshold. Also, I know that expectations are zeros but I don’t know the variance-covariance matrix…Could you please clarify a bit more?

Thanks in advance,
Alex

junpenglao · June 20, 2018, 8:46am

If I understand correctly, in your case one of the dimension the observed MvNormal cannot take value below a bound, meaning that if the actually value is below the threshold, it will just return the threshold value?

Kyrish · June 20, 2018, 9:15am

Yes, correct. And my goal to impute (or generate, or sample) each of these values which follow a conditional distribution

p(x_i | a_i, b_i)

where i is an index of sample’s row. But in order to apply conditional distribution for MVN I have to know variance-covariance matrix, but I don’t.

junpenglao · June 20, 2018, 12:34pm

The trick is to treat the cov matrix as known (i.e., Sigma in your model) and compute the conditional, I am trying to follow conditional distribution from wikipedia but the code is not quite work yet:

import numpy as np
import pymc3 as pm
import theano.tensor as tt

np.random.seed(42)
n = 10000
threshold = -0.15

# Here I define original Mean and Var-Covar Matrix
cov_matrix_original = np.array([[1, 0.8, 0.9], [0.8, 1., 0.75], [0.9, 0.75, 1.]])
Y_mu = np.array([0., 0., 0.])

# Generating sample n x 3
Y = np.random.multivariate_normal(mean=Y_mu, cov=cov_matrix_original, size=n)

# Do "censoring" for 3rd variate
mask_col = Y[:, -1] <= threshold
Y[mask_col, -1] = threshold

# compute the conditional CDF to account for the censoring
def mv_cond(idx, nd, mu, cov, cond_a):
    idx1 = np.arange(nd) != idx
    idx2 = np.arange(nd) == idx
    mu_1 = mu[idx1]
    mu_2 = mu[idx2]
    S11 = cov[idx1, idx1]
    S12 = cov[idx1, idx2]
    S21 = cov[idx2, idx1]
    S22 = cov[idx1, :][:, idx1]
    mu_ = mu_1 + tt.dot(S12.dot(tt.nlinalg.matrix_inverse(S22)), cond_a-mu_2)
    sd_ = S11 - tt.dot(S12.dot(tt.nlinalg.matrix_inverse(S22)), S21)
    return mu_, sd_

def normal_lcdf(mu, sigma, x):
    z = (x - mu) / sigma
    return tt.switch(
        tt.lt(z, -1.0),
        tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)
    )

# Censored model
with pm.Model():
    packed_L = pm.LKJCholeskyCov(
        'packed_L', n=3, eta=2., sd_dist=pm.HalfNormal.dist(2.5))
    L = pm.expand_packed_triangular(3, packed_L)
    Sigma = pm.Deterministic('Sigma', L.dot(L.T))

    Data = pm.MvNormal('Data', mu=Y_mu, chol=L, observed=Y)
    mu_c, sd_c = mv_cond(2, 3, Y_mu, Sigma, Y_mu[:-1])
    pm.Potential('censor_cdf', normal_lcdf(mu_c, sd_c, threshold) * mask_col.sum())
    trace = pm.sample()

junpenglao · June 20, 2018, 12:35pm

The estimation of the covariance matrix is still off. I dont have time to look into it further but I imagine there should be a better implementation to find the conditional CDF (currently using matrix inverse which can be quite slow)

Kyrish · June 20, 2018, 1:00pm

Thank you very much for the instructive reply. I’ll keep to dig into. Regards.

Topic		Replies	Views
Handling missing values in predictor when outcome is a Multivariate Normal distribution v5	7	99	October 25, 2024
Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,) Questions theano	12	1859	September 7, 2020
Multivariate model with censored data Questions	3	415	May 14, 2020
Generating censored data from posterior distribution	4	232	June 10, 2024
Automatic Imputation of Multivariate models Questions	4	723	December 19, 2022

Dealing with imputation of left censored multivariate normal distribution

Related topics