# Issue with the accuracy of reconstructed data for censored data analysis

Hello all,

I’ve been working on analyzing censored data using PyMC (version 5.10.4) and already asked a question a few days ago. I now have a working model that uses `pm.Censored` and `sample_posterior_predictive` to reconstruct the uncensored distribution. However, I am encountering some issues with the quality of reconstructed uncensored distributions.

This is the process I am following:

• I generate random samples from a lognormal distribution.
• Set an arbitrary censoring limit (e.g. at the median of the distribution) and censor all data below this limit.
• Run the MCMC model using `pm.Censored` to reconstruct the distribution from the posterior.
• Calculate key parameters (mu, sigma, median, etc.)
• Repeat the process multiple times and assess the results using MSE and bias.

The results are comparable to those obtained by the substitution method (setting all censored values to the censoring limit/2), which is generally considered suboptimal for such analyses. For your information, I started this because I wanted to have a Python version of the method developed in STAN by Suzuki et al., 2020. They also worked with log-normally distributed data and showed, for example, that with a censoring limit set to the 80th percentile, MCMC would yield an MSE two orders of magnitude better than the substitution method.

Here are my questions:

• Is there an error in how I’m setting up or running my model?
• Are there specific parameters or adjustments within PyMC that could potentially improve these outcomes?

I’ve attached a minimal reproducible example below for reference. Any insights or suggestions would be greatly appreciated.

``````import numpy as np
import pymc as pm
import pandas as pd
import scipy.stats as stats

# Define input for random data generation
N_obs = 150 # Number of observations per sample
mu_th = 1.5
sigma_th = 0.8
Q = 0.8 # Quantile to which the censoring limit is set
iter_nr = 20 # Number of samples

# Create a MultiIndex df to store the metrics from the varius samples
metrics = ['mu', 'sigma', 'mean', 'q25', 'median', 'q95']
methods = ['Original', 'Removal', 'Substitution', 'MCMC']
multi_index = pd.MultiIndex.from_product([metrics, methods], names=['Metric', 'Method'])
# Initialize the combined DataFrame
df_results = pd.DataFrame(index=range(iter_nr), columns=multi_index)

# Calculate the theoretical values of the other metrics
q25_th, med_th, q95_th = stats.lognorm.ppf( [.25,.5, .95],
scale = np.exp(mu_th), s = sigma_th)
mean_th = stats.lognorm.stats(scale = np.exp(mu_th), s = sigma_th, moments='m')
df_th = pd.DataFrame([[mu_th, sigma_th, mean_th, q25_th, med_th, q95_th]], columns = metrics)

# Define a function to calculate selected metrics from a sample and update df
def calculate_and_update_metrics(data, df, iteration, method):
"""
Calculate  metrics for each dataset and update the DataFrame.
"""
df.loc[iteration, ('mu', method)] = np.mean(np.log(data))
df.loc[iteration, ('sigma', method)] = np.std(np.log(data))
df.loc[iteration, ('mean', method)] = np.mean(data)
df.loc[iteration, ('q25', method)] = np.percentile(data, 25)
df.loc[iteration, ('median', method)] = np.median(data)
df.loc[iteration, ('q95', method)] = np.percentile(data, 95)
return(df)

for j in range(iter_nr):
# Generate random data
rnd_data = np.random.lognormal(mean=mu_th, sigma=sigma_th, size=N_obs)

# Random distribution values
df_results = calculate_and_update_metrics(rnd_data, df_results, j, 'Original')

# Censor data
CL = np.quantile(rnd_data, Q)
cen_data = rnd_data.copy()
cen_data[cen_data <= CL] = CL

# Removal
rem_data = rnd_data.copy()
rem_data = rem_data[rnd_data > CL]
df_results = calculate_and_update_metrics(rem_data, df_results, j, 'Removal')

# Substitution
subs_data = rnd_data.copy()
subs_data[subs_data<=CL] = CL/2
df_results = calculate_and_update_metrics(subs_data, df_results, j,
'Substitution')

# MCMC
with pm.Model() as censored_model:
# Define prior distribution
mu = pm.Normal("mu", mu = np.mean(np.log(cen_data)), sigma = 1)
sigma = pm.LogNormal("sigma", mu = np.std(np.log(cen_data)), sigma = 1)
# Assume sample distribution
x = pm.LogNormal.dist(mu=mu, sigma=sigma)
x_censored = pm.Censored("obs", x, lower=CL, upper=None, observed=cen_data)
# Sampling
trace = pm.sample(1000, chains = 2, return_inferencedata=True)

# Posterior predictive sampling
uncensored = pm.LogNormal('uncensored', mu, sigma)
pp = pm.sample_posterior_predictive(trace, var_names=['uncensored'], predictions=True)
pp_data = pp['predictions']['uncensored'].values.flatten()
df_results = calculate_and_update_metrics(pp_data, df_results, j,
'MCMC')

# Calculate MSE and Bias
df_MSE = pd.DataFrame(index = metrics, columns = methods)
df_bias = pd.DataFrame(index = metrics, columns = methods)

for j in metrics:
for k in methods:
MSE = np.mean((df_results[j,k].values-df_th[j].values)**2)
bias = np.mean(df_results[j,k].values-df_th[j].values)

df_MSE.loc[j, k] = MSE
df_bias.loc[j, k] = bias
``````

Just to confirm, I assume you’re testing `cen_data` vs `rem_data` vs `subs_data` it different runs, but you’re only showing the first one in the script above?

Not sure I understand your question. In this specific example I’m generating 20 different random samples (from the same distribution) with 150 elements per sample and then comparing the results of the different methods. I’ve run the script with up to 200 samples but didn’t observe significant changes.

You have some variables `rem_data` and `subs_data` that are not used in the model, what are they for?

Those are just for comparison and not part of the MCMC model. I am calculating what I would get if I was removing all the censored data (`rem_data`) and if I replace the censored data with CL/2 (`subs_data`, this is the substitution method). The idea is to check how better can the MCMC model be compared to those two other approaches which are known to be bad but still used.

For reference this is the MSE I get on a run with 200 samples, n = 150, censored up to the 80th percentile. MCMC provides better results on some metrics like sigma and worse on others like the median, but it’s not always like this. And if I reduce the sample size MCMC tends to perform worse on all metrics.

Removal Substitution MCMC
mu 1.250 0.049 0.023
sigma 0.190 0.100 0.021
mean 79.381 0.498 0.393
q25 60.969 3.301 0.264
median 66.212 0.183 0.372
q95 164.214 5.737 6.667

MCMC seems to be doing pretty well, always better than removal. The substitution is a strange manipulation, that certainly biases your estimates, but may just be that it biases it in a more favorable direction for those two metrics (median, and q95) where it comes out better (and worse for the remaining ones).

I don’t find anything too surprising?

Also, if you have very few datapoints, your will basically measure prior bias.

What kind of results where you expecting and why?

1 Like

Whether or not setting censored data to threshold/2 works well or not depends on where the mean of the data is wrt to threshold and what the is sd is I guess. Try cases where you censored below a higher quantile (0.99) or a lower quantile (0.01) and then the optimal rule of thumb should change. In fact if you only censor below say 0.01 quantile, then not applying the rule of thumb will likely give better results. This is why you need the censored, so that you don’t have to use “data specific” rule of thumbs.

2 Likes

As mentioned before my reference is a paper where they used STAN with a similar dataset (also lognormally distributed), in their case MCMC results at 80% censoring are about two orders of magnitude better for the median (GM) and the geometric standard deviation (GSD) when compared against the substitution method (RL/2 in the figure below). That’s why I was expecting better performances and suspecting an issue in my model.

So I think I would roll back to a simpler comparison first and just check the following:

For censoring across different quantiles say [0.01, 0.1, 0.5, 0.9, 0.99] estimate mu parameter via Censored and also do a non-parametric estimate of mu using censored data and data where you apply a rule of thumb. If everything is setup correctly, then mu should pretty much agree on 0.01
and deteriorate as q increases. As long as you have enough data (say 50+) I dont think you need to do multiple runs you can just run this once but across different q values. This is exactly what happens for a simple normal model for instance (you can even see where the rule of thumb likely switches from under estimating to over estimating):

``````import pymc as pm
import numpy as np
import arviz as az
import matplotlib.pyplot as plt

mu1 = 0
ndata = 100
data1 = np.random.normal(mu1, 1, size=ndata)

mu_vals = []

for q in [0.001, 0.1, 0.5, 0.9]:

lower = np.quantile(data1, q=q)

cdata1 = data1.copy()
cdata1[data1<lower] = lower

with pm.Model() as model:

mu = pm.Normal("mu", 0, 5)
sd = pm.InverseGamma("sd", 1)

dist = pm.Normal.dist(mu, sd)
pm.Censored("obs", dist=dist, lower=lower,
upper=np.inf, observed=cdata1)

idata = pm.sample( draws=1000)

mu_bayesian_est = az.summary(idata, var_names="mu").iloc[0,0]

mu_nonparam_est = np.mean(cdata1)

cdata1[cdata1==lower] = lower - 1
mu_mod_nonparam_est = np.mean(cdata1)

mu_vals.append([mu_bayesian_est, mu_nonparam_est, mu_mod_nonparam_est])

mu_vals = np.array(mu_vals).T
fig,ax = plt.subplots(1, 1, figsize=(5,5))

sim_mean = np.mean(data1)
ax.plot(0, sim_mean,
alpha=1, label="simulated uncensored mean",
color="black",zorder=1, marker='x')
ax.plot(mu_vals[0,:], label="bayesian", zorder=0, marker='o')
ax.plot(mu_vals[1,:], label="nonparametric", zorder=0, marker='o')
ax.plot(mu_vals[2,:], label="modified -1 nonparametric", zorder=0, marker='o')

ax.set_xticks(range(4))
ax.set_xticklabels([0.001, 0.1, 0.5, 0.9], rotation=90)
ax.set_xlabel("lower censoring quantile")
ax.set_ylabel("mu estimate")
``````

And the plot I get is

So throw away as much of the code as you can to strip it down to its essentials, compare only one parameter to start with and then add back complexity step by step until you reach somewhere where your expectations break.

3 Likes

Thanks a lot, I’ll try what you suggested and write back if I have questions or something to add.

1 Like

A short update in case you are curious: after trying a few things following what suggested by @iavicenna I decided to also run the STAN model from the paper to directly compare the results. STAN and PyMC produce comparable results and are clearly better than the substitution method.
I think in the paper they showed a larger discrepancy between MCMC and the substitution method because the distribution they used had a much longer tail (their kurtosis was around 3e4 while in the example I reported above was only around 30) and with longer tails the substitution method is clearly less reliable (my fault for not checking this earlier).

Anyway this was a good exercise and a good confirmation that it was not an issue of my simple PyMC model.

Thanks again for the time and the useful comments!

1 Like