Yesterday I uploaded a demo notebook of Truncated Regression (Example notebook for truncated regression) and I am trying to extend this to Censored Regression (aka Tobit Regression). So this is similar to this unanswered question Regression with censored response variable.
I’ve learnt a bit from this notebook https://docs.pymc.io/notebooks/censored_data.html which presents a simple example of estimation of a mean and sd of 1D censored data. It presents a model which imputes the values of censored data…
# Imputed censored model
n_right_censored = len(samples[samples >= high])
n_left_censored = len(samples[samples <= low])
n_observed = len(samples) - n_right_censored - n_left_censored
with pm.Model() as imputed_censored_model:
mu = pm.Normal('mu', mu=0., sigma=(high - low) / 2.)
sigma = pm.HalfNormal('sigma', sigma=(high - low) / 2.)
right_censored = pm.Bound(pm.Normal, lower=high)(
'right_censored', mu=mu, sigma=sigma, shape=n_right_censored
)
left_censored = pm.Bound(pm.Normal, upper=low)(
'left_censored', mu=mu, sigma=sigma, shape=n_left_censored
)
observed = pm.Normal(
'observed',
mu=mu,
sigma=sigma,
observed=censored,
shape=n_observed
)
Although note that the observed data in that model is in fact truncated data (where the points outside the bounds are removed), not censored data.
Presumably for a regression context you could modify this approach to set mu
as a function of x
?
So referring to the example figure below, would a sensible approach be to:
- infer
slope
,intercept
,sd
from truncated data (ie. data within the censor bounds) - Split the censored data up into left and right sets (ie
y<-1.5
andy>1.5
) - Impute their y values as in the example notebook above and code snippet BUT where
mu
is a function of the x coordinates of the censored data.
Questions
- Does this approach sound reasonable?
- Can anyone explain if and why we should be using
Bound
rather thanTruncatedNormal
in this context of censored data? - In the code example above (https://docs.pymc.io/notebooks/censored_data.html) I can’t work out why the imputation of
right_censored
andleft_censored
makes any difference to the estimate ofmu
andsigma
(it does, I’ve checked). Can anyone explain how that works?