I would like to model censored data in pymc v4. I am aware of the pm.Censored() functionality; however, this doesn’t work for me for two reasons:
- I am working with multivariate data, and pm.Censored() only works for univariate distributions.
- Each of my data points are censored at different levels.
So, I am using Scipy library which helps with issue 1., because it can evaluate the logcdf of a multivariate distribution, and using pm.Potential() helps with issue 2.
However, the model seems to have issues estimating the parameters. I have done the usual checks and also:
- Finding the MLE’s for the equivalent frequentist model to check if the MLE’s are what I expect.
- Checking the prior predictive.
import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
print(pm.__version__)
df = pd.read_csv("MRE.csv")
## construct (censored) likelihood function for each individual
def censored_log_likelihood_name(standardized, uncensored):
llh = pm.math.sum(pm.logp(pm.Normal.dist(0, 1), standardized[np.array(uncensored)]))
if np.sum(uncensored) != len(uncensored): ## if some times are censored
left_censored_name = (1-uncensored).astype(bool)
llh += pm.math.sum(pm.logcdf(pm.Normal.dist(0,1), standardized[np.array(left_censored_name)]))
return llh
def construct_model(df, fix_sigma=False, fix_mu = False):
IDs_idx, IDs = pd.factorize(df.full_name_computed)
coords = {
"IDs":IDs
}
## starting values if required
sigma_in = df.groupby("full_name_computed").Gaus.var().mean()
mu_in = df.groupby("full_name_computed").Gaus.mean()[IDs].values
with pm.Model(coords=coords) as model:
if fix_mu:
mu = mu_in
else:
a = pm.Normal("a", 2)
b = pm.Exponential("b", 1)
mu = pm.Normal("mu", a,b, dims = ["IDs"], initval = mu_in)
if fixed_sigma:
sigma = sigma_in
else:
tau = pm.Gamma("tau", alpha = 10, beta = 5)
sigma = pm.Deterministic("sigma",1/tau**2)
llh_name = []
for i, id_ in enumerate(IDs):
df_id = df[df.full_name_computed == id_]
standardized = (df_id.Gaus_censored.values - mu[i])/sigma ## standardize values to be Normal(0,1)
uncensored = df_id.uncensored
llh_name.append(censored_log_likelihood_name(standardized, uncensored))
llh = pm.Potential("llh", pm.math.sum(llh_name))
return model
The problem seems to be with estimating sigma
, the common variance for all individuals. I know this to be (empirically) df.groupby("full_name_computed").Gaus.var().mean() = 0.664
, however, when I run the above model, even fixing all the mu
parameters like so
model_fixmu = construct_model(df, False, True)
pm.model_to_graphviz(model_fixmu)
with model_fixmu:
model_fixmu_idata = pm.sample(chains=1, tune = 1000, draws = 1000)
The posterior estimate for sigma
is way different to that of both the prior and likelihood. See the posterior estimate below:
Whereas the prior for sigma
is given as:
And the MLE and empirical estimates of sigma
are 0.680
and 0.664
, respectively.
Can anyone explain how the model’s posterior estimate of sigma
is so poor, and how I can improve it?
Note: if I fix sigma
at the MLE using model_fixsigma = construct_model(df,True, False)
, the posterior estimates of mu
are bang on.
Many thanks,
Harry
p.s., I am aware there are ways I could simplify the model; however, for more complex models I will need this format.
MRE.csv (54.7 KB)