Issue with the accuracy of reconstructed data for censored data analysis

Baccandr · June 10, 2024, 10:35am

Hello all,

I’ve been working on analyzing censored data using PyMC (version 5.10.4) and already asked a question a few days ago. I now have a working model that uses pm.Censored and sample_posterior_predictive to reconstruct the uncensored distribution. However, I am encountering some issues with the quality of reconstructed uncensored distributions.

This is the process I am following:

I generate random samples from a lognormal distribution.
Set an arbitrary censoring limit (e.g. at the median of the distribution) and censor all data below this limit.
Run the MCMC model using pm.Censored to reconstruct the distribution from the posterior.
Calculate key parameters (mu, sigma, median, etc.)
Repeat the process multiple times and assess the results using MSE and bias.

The results are comparable to those obtained by the substitution method (setting all censored values to the censoring limit/2), which is generally considered suboptimal for such analyses. For your information, I started this because I wanted to have a Python version of the method developed in STAN by Suzuki et al., 2020. They also worked with log-normally distributed data and showed, for example, that with a censoring limit set to the 80th percentile, MCMC would yield an MSE two orders of magnitude better than the substitution method.

Here are my questions:

Is there an error in how I’m setting up or running my model?
Are there specific parameters or adjustments within PyMC that could potentially improve these outcomes?

I’ve attached a minimal reproducible example below for reference. Any insights or suggestions would be greatly appreciated.

import numpy as np
import pymc as pm
import pandas as pd
import scipy.stats as stats

# Define input for random data generation
N_obs = 150 # Number of observations per sample
mu_th = 1.5
sigma_th = 0.8
Q = 0.8 # Quantile to which the censoring limit is set
iter_nr = 20 # Number of samples

# Create a MultiIndex df to store the metrics from the varius samples
metrics = ['mu', 'sigma', 'mean', 'q25', 'median', 'q95']
methods = ['Original', 'Removal', 'Substitution', 'MCMC']
multi_index = pd.MultiIndex.from_product([metrics, methods], names=['Metric', 'Method'])
# Initialize the combined DataFrame
df_results = pd.DataFrame(index=range(iter_nr), columns=multi_index)

# Calculate the theoretical values of the other metrics
q25_th, med_th, q95_th = stats.lognorm.ppf( [.25,.5, .95],
                                           scale = np.exp(mu_th), s = sigma_th)
mean_th = stats.lognorm.stats(scale = np.exp(mu_th), s = sigma_th, moments='m')
df_th = pd.DataFrame([[mu_th, sigma_th, mean_th, q25_th, med_th, q95_th]], columns = metrics)

# Define a function to calculate selected metrics from a sample and update df
def calculate_and_update_metrics(data, df, iteration, method):
    """
    Calculate  metrics for each dataset and update the DataFrame.
    """
    df.loc[iteration, ('mu', method)] = np.mean(np.log(data))
    df.loc[iteration, ('sigma', method)] = np.std(np.log(data))
    df.loc[iteration, ('mean', method)] = np.mean(data)
    df.loc[iteration, ('q25', method)] = np.percentile(data, 25)
    df.loc[iteration, ('median', method)] = np.median(data)
    df.loc[iteration, ('q95', method)] = np.percentile(data, 95)
    return(df)

for j in range(iter_nr):
  # Generate random data
  rnd_data = np.random.lognormal(mean=mu_th, sigma=sigma_th, size=N_obs)

  # Random distribution values
  df_results = calculate_and_update_metrics(rnd_data, df_results, j, 'Original')

  # Censor data
  CL = np.quantile(rnd_data, Q)
  cen_data = rnd_data.copy()
  cen_data[cen_data <= CL] = CL

  # Removal
  rem_data = rnd_data.copy()
  rem_data = rem_data[rnd_data > CL]
  df_results = calculate_and_update_metrics(rem_data, df_results, j, 'Removal')

  # Substitution
  subs_data = rnd_data.copy()
  subs_data[subs_data<=CL] = CL/2
  df_results = calculate_and_update_metrics(subs_data, df_results, j,
                                            'Substitution')

  # MCMC
  with pm.Model() as censored_model:
    # Define prior distribution 
    mu = pm.Normal("mu", mu = np.mean(np.log(cen_data)), sigma = 1)
    sigma = pm.LogNormal("sigma", mu = np.std(np.log(cen_data)), sigma = 1)
    # Assume sample distribution
    x = pm.LogNormal.dist(mu=mu, sigma=sigma)
    x_censored = pm.Censored("obs", x, lower=CL, upper=None, observed=cen_data)
    # Sampling
    trace = pm.sample(1000, chains = 2, return_inferencedata=True)

    # Posterior predictive sampling
    uncensored = pm.LogNormal('uncensored', mu, sigma)
    pp = pm.sample_posterior_predictive(trace, var_names=['uncensored'], predictions=True)
    pp_data = pp['predictions']['uncensored'].values.flatten()
    df_results = calculate_and_update_metrics(pp_data, df_results, j,
                                            'MCMC')

# Calculate MSE and Bias
df_MSE = pd.DataFrame(index = metrics, columns = methods)
df_bias = pd.DataFrame(index = metrics, columns = methods)

for j in metrics:
  for k in methods:
    MSE = np.mean((df_results[j,k].values-df_th[j].values)**2)
    bias = np.mean(df_results[j,k].values-df_th[j].values)

    df_MSE.loc[j, k] = MSE
    df_bias.loc[j, k] = bias

Thanks for your help!

ricardoV94 · June 10, 2024, 10:57am

Just to confirm, I assume you’re testing cen_data vs rem_data vs subs_data it different runs, but you’re only showing the first one in the script above?

Baccandr · June 10, 2024, 12:11pm

Not sure I understand your question. In this specific example I’m generating 20 different random samples (from the same distribution) with 150 elements per sample and then comparing the results of the different methods. I’ve run the script with up to 200 samples but didn’t observe significant changes.

ricardoV94 · June 10, 2024, 12:50pm

You have some variables rem_data and subs_data that are not used in the model, what are they for?

Baccandr · June 10, 2024, 1:02pm

Those are just for comparison and not part of the MCMC model. I am calculating what I would get if I was removing all the censored data (rem_data) and if I replace the censored data with CL/2 (subs_data, this is the substitution method). The idea is to check how better can the MCMC model be compared to those two other approaches which are known to be bad but still used.

For reference this is the MSE I get on a run with 200 samples, n = 150, censored up to the 80th percentile. MCMC provides better results on some metrics like sigma and worse on others like the median, but it’s not always like this. And if I reduce the sample size MCMC tends to perform worse on all metrics.

	Removal	Substitution	MCMC
mu	1.250	0.049	0.023
sigma	0.190	0.100	0.021
mean	79.381	0.498	0.393
q25	60.969	3.301	0.264
median	66.212	0.183	0.372
q95	164.214	5.737	6.667

ricardoV94 · June 10, 2024, 1:17pm

MCMC seems to be doing pretty well, always better than removal. The substitution is a strange manipulation, that certainly biases your estimates, but may just be that it biases it in a more favorable direction for those two metrics (median, and q95) where it comes out better (and worse for the remaining ones).

I don’t find anything too surprising?

Also, if you have very few datapoints, your will basically measure prior bias.

What kind of results where you expecting and why?

iavicenna · June 10, 2024, 1:23pm

Whether or not setting censored data to threshold/2 works well or not depends on where the mean of the data is wrt to threshold and what the is sd is I guess. Try cases where you censored below a higher quantile (0.99) or a lower quantile (0.01) and then the optimal rule of thumb should change. In fact if you only censor below say 0.01 quantile, then not applying the rule of thumb will likely give better results. This is why you need the censored, so that you don’t have to use “data specific” rule of thumbs.

Baccandr · June 10, 2024, 1:44pm

Thanks for the answers!
As mentioned before my reference is a paper where they used STAN with a similar dataset (also lognormally distributed), in their case MCMC results at 80% censoring are about two orders of magnitude better for the median (GM) and the geometric standard deviation (GSD) when compared against the substitution method (RL/2 in the figure below). That’s why I was expecting better performances and suspecting an issue in my model.

iavicenna · June 10, 2024, 2:13pm

So I think I would roll back to a simpler comparison first and just check the following:

For censoring across different quantiles say [0.01, 0.1, 0.5, 0.9, 0.99] estimate mu parameter via Censored and also do a non-parametric estimate of mu using censored data and data where you apply a rule of thumb. If everything is setup correctly, then mu should pretty much agree on 0.01
and deteriorate as q increases. As long as you have enough data (say 50+) I dont think you need to do multiple runs you can just run this once but across different q values. This is exactly what happens for a simple normal model for instance (you can even see where the rule of thumb likely switches from under estimating to over estimating):

import pymc as pm
import numpy as np
import arviz as az
import matplotlib.pyplot as plt

mu1 = 0
ndata = 100
data1 = np.random.normal(mu1, 1, size=ndata)

mu_vals = []

for q in [0.001, 0.1, 0.5, 0.9]:

  lower = np.quantile(data1, q=q)

  cdata1 = data1.copy()
  cdata1[data1<lower] = lower

  with pm.Model() as model:

    mu = pm.Normal("mu", 0, 5)
    sd = pm.InverseGamma("sd", 1)

    dist = pm.Normal.dist(mu, sd)
    pm.Censored("obs", dist=dist, lower=lower,
                upper=np.inf, observed=cdata1)

    idata = pm.sample( draws=1000)

  mu_bayesian_est = az.summary(idata, var_names="mu").iloc[0,0]

  mu_nonparam_est = np.mean(cdata1)

  cdata1[cdata1==lower] = lower - 1
  mu_mod_nonparam_est = np.mean(cdata1)

  mu_vals.append([mu_bayesian_est, mu_nonparam_est, mu_mod_nonparam_est])


mu_vals = np.array(mu_vals).T
fig,ax = plt.subplots(1, 1, figsize=(5,5))

sim_mean = np.mean(data1)
ax.plot(0, sim_mean,
        alpha=1, label="simulated uncensored mean",
        color="black",zorder=1, marker='x')
ax.plot(mu_vals[0,:], label="bayesian", zorder=0, marker='o')
ax.plot(mu_vals[1,:], label="nonparametric", zorder=0, marker='o')
ax.plot(mu_vals[2,:], label="modified -1 nonparametric", zorder=0, marker='o')

ax.set_xticks(range(4))
ax.set_xticklabels([0.001, 0.1, 0.5, 0.9], rotation=90)
ax.set_xlabel("lower censoring quantile")
ax.set_ylabel("mu estimate")

And the plot I get is

compareq

So throw away as much of the code as you can to strip it down to its essentials, compare only one parameter to start with and then add back complexity step by step until you reach somewhere where your expectations break.

Baccandr · June 10, 2024, 3:06pm

Thanks a lot, I’ll try what you suggested and write back if I have questions or something to add.

Baccandr · June 10, 2024, 9:13pm

A short update in case you are curious: after trying a few things following what suggested by @iavicenna I decided to also run the STAN model from the paper to directly compare the results. STAN and PyMC produce comparable results and are clearly better than the substitution method.
I think in the paper they showed a larger discrepancy between MCMC and the substitution method because the distribution they used had a much longer tail (their kurtosis was around 3e4 while in the example I reported above was only around 30) and with longer tails the substitution method is clearly less reliable (my fault for not checking this earlier).

Anyway this was a good exercise and a good confirmation that it was not an issue of my simple PyMC model.

Thanks again for the time and the useful comments!

Topic		Replies	Views
Generating censored data from posterior distribution	4	232	June 10, 2024
Modelling censored data in pymc v4: a simple Gaussian random effects example v5 development , modeling	0	387	August 30, 2022
Sample posterior predictive with censored data Questions	3	522	April 7, 2023
Translating Censored Data example from PyMC2 to PyMC3 Questions	2	705	July 27, 2017
Advice on Modeling Censored Data from a Ratio of Variables version agnostic	1	21	March 13, 2025

Issue with the accuracy of reconstructed data for censored data analysis

Related topics