Help with Censored Regression

Yesterday I uploaded a demo notebook of Truncated Regression (Example notebook for truncated regression) and I am trying to extend this to Censored Regression (aka Tobit Regression). So this is similar to this unanswered question Regression with censored response variable.

I’ve learnt a bit from this notebook https://docs.pymc.io/notebooks/censored_data.html which presents a simple example of estimation of a mean and sd of 1D censored data. It presents a model which imputes the values of censored data…

# Imputed censored model
n_right_censored = len(samples[samples >= high])
n_left_censored = len(samples[samples <= low])
n_observed = len(samples) - n_right_censored - n_left_censored

with pm.Model() as imputed_censored_model:
    mu = pm.Normal('mu', mu=0., sigma=(high - low) / 2.)
    sigma = pm.HalfNormal('sigma', sigma=(high - low) / 2.)

    right_censored = pm.Bound(pm.Normal, lower=high)(
        'right_censored', mu=mu, sigma=sigma, shape=n_right_censored
    )
    left_censored = pm.Bound(pm.Normal, upper=low)(
        'left_censored', mu=mu, sigma=sigma, shape=n_left_censored
    )

    observed = pm.Normal(
        'observed',
        mu=mu,
        sigma=sigma,
        observed=censored,
        shape=n_observed
    )

Although note that the observed data in that model is in fact truncated data (where the points outside the bounds are removed), not censored data.

Presumably for a regression context you could modify this approach to set mu as a function of x?

So referring to the example figure below, would a sensible approach be to:

  • infer slope, intercept, sd from truncated data (ie. data within the censor bounds)
  • Split the censored data up into left and right sets (ie y<-1.5 and y>1.5)
  • Impute their y values as in the example notebook above and code snippet BUT where mu is a function of the x coordinates of the censored data.

Questions

  1. Does this approach sound reasonable?
  2. Can anyone explain if and why we should be using Bound rather than TruncatedNormal in this context of censored data?
  3. In the code example above (https://docs.pymc.io/notebooks/censored_data.html) I can’t work out why the imputation of right_censored and left_censored makes any difference to the estimate of mu and sigma (it does, I’ve checked). Can anyone explain how that works?
2 Likes

Wouldn’t it be a lot simpler to set the censored data to nans and fit a usual regression model, allowing PyMC to impute the missing data?

I guess this loses the information concerning whether a given censored datum crossed the upper or lower bound, but in real situations you might not know this anyway.

I tried this, setting y values outside the bounds to missing, and keeping the x values. You can impute the y values of the missing data


but that doesn’t seem to fix the bias in estimating the slope parameter. I think this is because a regular normal likelihood distribution is not appropriate in this context.

I could plot the data outside the bounds with known x and imputed y, but I was lazy.

I think the way principled way forward is to use a custom likelihood function (see Tobit Regression but find the eq. for both upper and lower bounds. I’ve not done custom likelihoods in PyMC3 before, so we’ll see how that goes.

That makes sense; I hadn’t considered the bias. Good luck, very interested to see if this works.

Initial approach seems a bit ad hoc. I’ve implemented Tobit regression now (with both upper and lower bounds) in the setting of maximum likelihood. So this seems to confirm that things are on the right track.

Used this custom likelihood function, just using scipy.optimize.minimize

import scipy.stats as stats


def tobit_regression_nll(params):
    """Negative log likelihood for Tobit (censored) regression with both 
    upper and lower bounds. Operates on censored data (xc, yc, censor_low,
    censor_high, censored).
    Inputs censor_low, censor_high, censored are all arrays of booleans.
    Inputs cx, cy are the censored observed data.
    """
    c, m, sd = params[0], params[1], params[2]

    negLL = 0

    # lower censored data
    yhat = c + m * xc[censor_low == True]
    negLL = negLL + -np.sum(
        stats.norm.logcdf(yc[censor_low == True], loc=yhat, scale=sd)
    )

    # uncensored data
    yhat = c + m * xc[censored == False]
    negLL = negLL + -np.sum(
        stats.norm.logpdf(yc[censored == False], loc=yhat, scale=sd)
    )

    # upper censored data. logsf = log survival function = log(1 - cdf)
    yhat = c + m * xc[censor_high == True]
    negLL = negLL + -np.sum(
        stats.norm.logsf(yc[censor_high == True], loc=yhat, scale=sd)
    )
    return negLL

This seems to work very well:

Just have to work out how to implement this as a custom likelihood function in PyMC3 now.

3 Likes

logcdf is implemented for Normal distribution in PyMC3: https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/continuous.py#L545

However, it looks like the tobit_regression is the same as using a TruncateNormal likelihood from PyMC3?

I got the impression that a TruncatedNormal is a normal but with zero mass outside the bounds. Which is different from when you have censoring, where you add the mass below the lower bound to the pdf at the lower bound, and the mass above the upper bound at the upper bounds. So the likelihood for censored regression would look like a truncated normal, but with ‘horns’ at the bounds? But maybe I misunderstood the difference between Bounds and TruncatedNormal ?

EDIT: As far as I can make out applying Bounds to a normal is the same as using Truncated normal ?

They are different - Bound cuts out the density outside of the bound which is improper - Truncate Normal actually apply the cdf to the value outside of the bound: https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/continuous.py#L736

So if I’m right, what we are saying is that

  • for truncated regression you need a bounded normal
  • for censored normal you need a truncated normal.
  • and if you use a truncated normal for truncated data, then you are doing it wrong.

Doesn’t this seem a bit non-intuitive? Seems like https://github.com/pymc-devs/pymc3/issues/1864 is relevant here.

Censored regression

This now seems much easier than I thought as you can use Truncated Normal. When I do this with an example with censored data (see https://github.com/drbenvincent/pymc3-demo-code/blob/master/censored_regression_pymc3.ipynb) then it doesn’t seem to work… in that we still have a bias in the inferred slope.


My guess here is that because the data is already censored, perhaps there are numerical precision issues which means the censored data is being treated as uncensored? In which case we should manually push the censored data further into the censored zone to avoid these numerical precision issues? Or maybe it’s something else entirely?

And pointers much appreciated.

EDIT: manually pushing the censored data further into censored zones results in Bad initial energy error, so I guess my hunch was wrong. In which case, I have no idea why using TruncatedNormal still produces a biased estimate.