Generating censored data from posterior distribution

Hello all,
I am fairly new to pymc and probabilistic programming in general. I’ve been trying to use MCMC for the treatment of dataset with left-censored data and found this very nice example.

I’ve run some tests with the model and I get good results, but I have troubles extracting samples from the left censored posterior distribution. The example says that the imputed censored model is capable of generating data sets of censored values (sample from the posteriors of left_censored and right_censored to generate them), however when sampling from the posteriors I get data from the entire distribution and not only the censored values. A minimal repro is below. What I am doing wrong?

I also have a more general question related to this topic: for a simple distribution like in this case, is there a reason for using the sample_posterior_predictive function instead of simply drawing random samples from a newly defined distribution with the parameters (sigma and mu) obtained from the MCMC simulation?

Thanks a lot already,
best regards

Minimal Repro

import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
import seaborn as sns

Define input data

N_data = 150
mu_th = 1.5
sigma_th = 0.8
Q = 0.5

Generate random data

Y_data = np.random.normal(loc=mu_th, scale=sigma_th, size=N_data)

Censor data

RL = np.quantile(Y_data, Q)
Y_obs = Y_data[Y_data>RL]
N_cen = len(Y_data[Y_data<=RL])
N_obs = len(Y_obs)

with pm.Model() as imputed_censored_model:
mu = pm.Normal(“mu”, mu = np.mean(Y_obs), sigma = 1)
sigma = pm.LogNormal(“sigma”, mu = np.std(Y_obs), sigma = 1)

left_censored = pm.Normal("left_censored", mu,sigma,
transform=pm.distributions.transforms.Interval(None, RL),
shape=N_cen, initval=np.full(N_cen, RL - 1))

observed = pm.Normal("obs", mu=mu, sigma=sigma, observed=Y_obs, shape=N_obs)
# Sampling
trace = pm.sample(1000, return_inferencedata=True)

# Posterior predictive
posterior_predictive_censored = pm.sample_posterior_predictive(trace, var_names = ['left_censored'])
posterior_predictive_obs = pm.sample_posterior_predictive(trace, var_names = ['obs'])

Extract values

ppc_val = posterior_predictive_censored.posterior_predictive[‘left_censored’].values.flatten()
ppo_val = posterior_predictive_obs.posterior_predictive[‘obs’].values.flatten()


fig, axs = plt.subplots()
sns.kdeplot(ppc_val, ax = axs, label = ‘left censored’)
sns.kdeplot(ppo_val, ax = axs, ls = ‘–’, label = ‘obs’)

1 Like

I think you actually have Truncated data not Censored. In any case you could try to use pm.Censored or pm.Truncated

1 Like

Thank you Ricardo,
yes you are right, in the example the data are truncated but in my real application I am working with censored data. In this model I am removing the censored data but using the information on their number which would not be available with a truncated dataset.
I’ve used pm.Censored and that works fine but I wanted to use an imputed censored model to be able to generate plausible sets of censored data. From my understanding that’s not possible with pm.Censored and it’s also explained in the example I’ve linked above.

So, if you have ideas regarding my question above that would be great.

You can impute during sample_posterior_predictive as demonstrated in this blogpost: Out of model predictions with PyMC - PyMC Labs

See section “Predicting uncensored variables”. You basically add a new variable equivalent to the “dist” of the Censored. Should be equivalent to imputing with mcmc during pm.sample, but faster and easier to write down.

1 Like

Thank you very much Ricardo and sorry for my slow answer.
The blogpost is very clear and I was able to implement the solution you suggested.
Unfortunately, I’m not getting the results I expected but I’ve asked this in a separate thread.

1 Like