Hello all,
I’ve been working on analyzing censored data using PyMC (version 5.10.4) and already asked a question a few days ago. I now have a working model that uses pm.Censored
and sample_posterior_predictive
to reconstruct the uncensored distribution. However, I am encountering some issues with the quality of reconstructed uncensored distributions.
This is the process I am following:
- I generate random samples from a lognormal distribution.
- Set an arbitrary censoring limit (e.g. at the median of the distribution) and censor all data below this limit.
- Run the MCMC model using
pm.Censored
to reconstruct the distribution from the posterior. - Calculate key parameters (mu, sigma, median, etc.)
- Repeat the process multiple times and assess the results using MSE and bias.
The results are comparable to those obtained by the substitution method (setting all censored values to the censoring limit/2), which is generally considered suboptimal for such analyses. For your information, I started this because I wanted to have a Python version of the method developed in STAN by Suzuki et al., 2020. They also worked with log-normally distributed data and showed, for example, that with a censoring limit set to the 80th percentile, MCMC would yield an MSE two orders of magnitude better than the substitution method.
Here are my questions:
- Is there an error in how I’m setting up or running my model?
- Are there specific parameters or adjustments within PyMC that could potentially improve these outcomes?
I’ve attached a minimal reproducible example below for reference. Any insights or suggestions would be greatly appreciated.
import numpy as np
import pymc as pm
import pandas as pd
import scipy.stats as stats
# Define input for random data generation
N_obs = 150 # Number of observations per sample
mu_th = 1.5
sigma_th = 0.8
Q = 0.8 # Quantile to which the censoring limit is set
iter_nr = 20 # Number of samples
# Create a MultiIndex df to store the metrics from the varius samples
metrics = ['mu', 'sigma', 'mean', 'q25', 'median', 'q95']
methods = ['Original', 'Removal', 'Substitution', 'MCMC']
multi_index = pd.MultiIndex.from_product([metrics, methods], names=['Metric', 'Method'])
# Initialize the combined DataFrame
df_results = pd.DataFrame(index=range(iter_nr), columns=multi_index)
# Calculate the theoretical values of the other metrics
q25_th, med_th, q95_th = stats.lognorm.ppf( [.25,.5, .95],
scale = np.exp(mu_th), s = sigma_th)
mean_th = stats.lognorm.stats(scale = np.exp(mu_th), s = sigma_th, moments='m')
df_th = pd.DataFrame([[mu_th, sigma_th, mean_th, q25_th, med_th, q95_th]], columns = metrics)
# Define a function to calculate selected metrics from a sample and update df
def calculate_and_update_metrics(data, df, iteration, method):
"""
Calculate metrics for each dataset and update the DataFrame.
"""
df.loc[iteration, ('mu', method)] = np.mean(np.log(data))
df.loc[iteration, ('sigma', method)] = np.std(np.log(data))
df.loc[iteration, ('mean', method)] = np.mean(data)
df.loc[iteration, ('q25', method)] = np.percentile(data, 25)
df.loc[iteration, ('median', method)] = np.median(data)
df.loc[iteration, ('q95', method)] = np.percentile(data, 95)
return(df)
for j in range(iter_nr):
# Generate random data
rnd_data = np.random.lognormal(mean=mu_th, sigma=sigma_th, size=N_obs)
# Random distribution values
df_results = calculate_and_update_metrics(rnd_data, df_results, j, 'Original')
# Censor data
CL = np.quantile(rnd_data, Q)
cen_data = rnd_data.copy()
cen_data[cen_data <= CL] = CL
# Removal
rem_data = rnd_data.copy()
rem_data = rem_data[rnd_data > CL]
df_results = calculate_and_update_metrics(rem_data, df_results, j, 'Removal')
# Substitution
subs_data = rnd_data.copy()
subs_data[subs_data<=CL] = CL/2
df_results = calculate_and_update_metrics(subs_data, df_results, j,
'Substitution')
# MCMC
with pm.Model() as censored_model:
# Define prior distribution
mu = pm.Normal("mu", mu = np.mean(np.log(cen_data)), sigma = 1)
sigma = pm.LogNormal("sigma", mu = np.std(np.log(cen_data)), sigma = 1)
# Assume sample distribution
x = pm.LogNormal.dist(mu=mu, sigma=sigma)
x_censored = pm.Censored("obs", x, lower=CL, upper=None, observed=cen_data)
# Sampling
trace = pm.sample(1000, chains = 2, return_inferencedata=True)
# Posterior predictive sampling
uncensored = pm.LogNormal('uncensored', mu, sigma)
pp = pm.sample_posterior_predictive(trace, var_names=['uncensored'], predictions=True)
pp_data = pp['predictions']['uncensored'].values.flatten()
df_results = calculate_and_update_metrics(pp_data, df_results, j,
'MCMC')
# Calculate MSE and Bias
df_MSE = pd.DataFrame(index = metrics, columns = methods)
df_bias = pd.DataFrame(index = metrics, columns = methods)
for j in metrics:
for k in methods:
MSE = np.mean((df_results[j,k].values-df_th[j].values)**2)
bias = np.mean(df_results[j,k].values-df_th[j].values)
df_MSE.loc[j, k] = MSE
df_bias.loc[j, k] = bias
Thanks for your help!