Modelling censored data in pymc v4: a simple Gaussian random effects example

I would like to model censored data in pymc v4. I am aware of the pm.Censored() functionality; however, this doesn’t work for me for two reasons:

  1. I am working with multivariate data, and pm.Censored() only works for univariate distributions.
  2. Each of my data points are censored at different levels.

So, I am using Scipy library which helps with issue 1., because it can evaluate the logcdf of a multivariate distribution, and using pm.Potential() helps with issue 2.

However, the model seems to have issues estimating the parameters. I have done the usual checks and also:

  • Finding the MLE’s for the equivalent frequentist model to check if the MLE’s are what I expect.
  • Checking the prior predictive.
import arviz as az
import numpy as np
import pandas as pd

import pymc as pm

print(pm.__version__)

df = pd.read_csv("MRE.csv")

## construct (censored) likelihood function for each individual
def censored_log_likelihood_name(standardized, uncensored):
        
    llh = pm.math.sum(pm.logp(pm.Normal.dist(0, 1), standardized[np.array(uncensored)]))

    if np.sum(uncensored) != len(uncensored): ## if some times are censored
    
        left_censored_name = (1-uncensored).astype(bool)        
        llh += pm.math.sum(pm.logcdf(pm.Normal.dist(0,1), standardized[np.array(left_censored_name)]))

    return llh


def construct_model(df, fix_sigma=False, fix_mu = False):

    IDs_idx, IDs = pd.factorize(df.full_name_computed)
    coords = {
        "IDs":IDs
    }

    ## starting values if required
    sigma_in = df.groupby("full_name_computed").Gaus.var().mean()
    mu_in = df.groupby("full_name_computed").Gaus.mean()[IDs].values

    with pm.Model(coords=coords) as model:
        
        if fix_mu:
            mu = mu_in
        else:
            a = pm.Normal("a", 2)
            b = pm.Exponential("b", 1)
            mu = pm.Normal("mu", a,b, dims = ["IDs"], initval = mu_in)

        if fixed_sigma:
            sigma = sigma_in
        else:
            tau = pm.Gamma("tau", alpha = 10, beta = 5)
            sigma = pm.Deterministic("sigma",1/tau**2)
        
        llh_name = []
        for i, id_ in enumerate(IDs):
            df_id = df[df.full_name_computed == id_]

            standardized = (df_id.Gaus_censored.values - mu[i])/sigma ## standardize values to be Normal(0,1)
            uncensored = df_id.uncensored

            llh_name.append(censored_log_likelihood_name(standardized, uncensored))
                
        llh = pm.Potential("llh", pm.math.sum(llh_name))

    return model

The problem seems to be with estimating sigma, the common variance for all individuals. I know this to be (empirically) df.groupby("full_name_computed").Gaus.var().mean() = 0.664, however, when I run the above model, even fixing all the mu parameters like so

model_fixmu = construct_model(df, False, True)
pm.model_to_graphviz(model_fixmu)
with model_fixmu:
    model_fixmu_idata = pm.sample(chains=1, tune = 1000, draws = 1000)

The posterior estimate for sigma is way different to that of both the prior and likelihood. See the posterior estimate below:

Whereas the prior for sigma is given as:
image
And the MLE and empirical estimates of sigma are 0.680 and 0.664, respectively.

Can anyone explain how the model’s posterior estimate of sigma is so poor, and how I can improve it?

Note: if I fix sigma at the MLE using model_fixsigma = construct_model(df,True, False), the posterior estimates of mu are bang on.

Many thanks,
Harry

p.s., I am aware there are ways I could simplify the model; however, for more complex models I will need this format.
MRE.csv (54.7 KB)