Help with Censored Regression

Yesterday I uploaded a demo notebook of Truncated Regression (Example notebook for truncated regression) and I am trying to extend this to Censored Regression (aka Tobit Regression). So this is similar to this unanswered question Regression with censored response variable.

I’ve learnt a bit from this notebook https://docs.pymc.io/notebooks/censored_data.html which presents a simple example of estimation of a mean and sd of 1D censored data. It presents a model which imputes the values of censored data…

# Imputed censored model
n_right_censored = len(samples[samples >= high])
n_left_censored = len(samples[samples <= low])
n_observed = len(samples) - n_right_censored - n_left_censored

with pm.Model() as imputed_censored_model:
    mu = pm.Normal('mu', mu=0., sigma=(high - low) / 2.)
    sigma = pm.HalfNormal('sigma', sigma=(high - low) / 2.)

    right_censored = pm.Bound(pm.Normal, lower=high)(
        'right_censored', mu=mu, sigma=sigma, shape=n_right_censored
    )
    left_censored = pm.Bound(pm.Normal, upper=low)(
        'left_censored', mu=mu, sigma=sigma, shape=n_left_censored
    )

    observed = pm.Normal(
        'observed',
        mu=mu,
        sigma=sigma,
        observed=censored,
        shape=n_observed
    )

Although note that the observed data in that model is in fact truncated data (where the points outside the bounds are removed), not censored data.

Presumably for a regression context you could modify this approach to set mu as a function of x?

So referring to the example figure below, would a sensible approach be to:

  • infer slope, intercept, sd from truncated data (ie. data within the censor bounds)
  • Split the censored data up into left and right sets (ie y<-1.5 and y>1.5)
  • Impute their y values as in the example notebook above and code snippet BUT where mu is a function of the x coordinates of the censored data.

Questions

  1. Does this approach sound reasonable?
  2. Can anyone explain if and why we should be using Bound rather than TruncatedNormal in this context of censored data?
  3. In the code example above (https://docs.pymc.io/notebooks/censored_data.html) I can’t work out why the imputation of right_censored and left_censored makes any difference to the estimate of mu and sigma (it does, I’ve checked). Can anyone explain how that works?
2 Likes

Wouldn’t it be a lot simpler to set the censored data to nans and fit a usual regression model, allowing PyMC to impute the missing data?

I guess this loses the information concerning whether a given censored datum crossed the upper or lower bound, but in real situations you might not know this anyway.

I tried this, setting y values outside the bounds to missing, and keeping the x values. You can impute the y values of the missing data


but that doesn’t seem to fix the bias in estimating the slope parameter. I think this is because a regular normal likelihood distribution is not appropriate in this context.

I could plot the data outside the bounds with known x and imputed y, but I was lazy.

I think the way principled way forward is to use a custom likelihood function (see Tobit Regression but find the eq. for both upper and lower bounds. I’ve not done custom likelihoods in PyMC3 before, so we’ll see how that goes.

That makes sense; I hadn’t considered the bias. Good luck, very interested to see if this works.

Initial approach seems a bit ad hoc. I’ve implemented Tobit regression now (with both upper and lower bounds) in the setting of maximum likelihood. So this seems to confirm that things are on the right track.

Used this custom likelihood function, just using scipy.optimize.minimize

import scipy.stats as stats


def tobit_regression_nll(params):
    """Negative log likelihood for Tobit (censored) regression with both 
    upper and lower bounds. Operates on censored data (xc, yc, censor_low,
    censor_high, censored).
    Inputs censor_low, censor_high, censored are all arrays of booleans.
    Inputs cx, cy are the censored observed data.
    """
    c, m, sd = params[0], params[1], params[2]

    negLL = 0

    # lower censored data
    yhat = c + m * xc[censor_low == True]
    negLL = negLL + -np.sum(
        stats.norm.logcdf(yc[censor_low == True], loc=yhat, scale=sd)
    )

    # uncensored data
    yhat = c + m * xc[censored == False]
    negLL = negLL + -np.sum(
        stats.norm.logpdf(yc[censored == False], loc=yhat, scale=sd)
    )

    # upper censored data. logsf = log survival function = log(1 - cdf)
    yhat = c + m * xc[censor_high == True]
    negLL = negLL + -np.sum(
        stats.norm.logsf(yc[censor_high == True], loc=yhat, scale=sd)
    )
    return negLL

This seems to work very well:

Just have to work out how to implement this as a custom likelihood function in PyMC3 now.

4 Likes

logcdf is implemented for Normal distribution in PyMC3: https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/continuous.py#L545

However, it looks like the tobit_regression is the same as using a TruncateNormal likelihood from PyMC3?

I got the impression that a TruncatedNormal is a normal but with zero mass outside the bounds. Which is different from when you have censoring, where you add the mass below the lower bound to the pdf at the lower bound, and the mass above the upper bound at the upper bounds. So the likelihood for censored regression would look like a truncated normal, but with ‘horns’ at the bounds? But maybe I misunderstood the difference between Bounds and TruncatedNormal ?

EDIT: As far as I can make out applying Bounds to a normal is the same as using Truncated normal ?

They are different - Bound cuts out the density outside of the bound which is improper - Truncate Normal actually apply the cdf to the value outside of the bound: https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/continuous.py#L736

So if I’m right, what we are saying is that

  • for truncated regression you need a bounded normal
  • for censored normal you need a truncated normal.
  • and if you use a truncated normal for truncated data, then you are doing it wrong.

Doesn’t this seem a bit non-intuitive? Seems like https://github.com/pymc-devs/pymc3/issues/1864 is relevant here.

Censored regression

This now seems much easier than I thought as you can use Truncated Normal. When I do this with an example with censored data (see https://github.com/drbenvincent/pymc3-demo-code/blob/master/censored_regression_pymc3.ipynb) then it doesn’t seem to work… in that we still have a bias in the inferred slope.


My guess here is that because the data is already censored, perhaps there are numerical precision issues which means the censored data is being treated as uncensored? In which case we should manually push the censored data further into the censored zone to avoid these numerical precision issues? Or maybe it’s something else entirely?

And pointers much appreciated.

EDIT: manually pushing the censored data further into censored zones results in Bad initial energy error, so I guess my hunch was wrong. In which case, I have no idea why using TruncatedNormal still produces a biased estimate.

Was this ever resolved? I am now faced with a similar issue.

thank you!
-Steve

I’ve not yet gotten around to fully sorting out correct censored and truncated regression examples. Partly because things got confusing, so I moved on to lower resistance tasks. For example this PyMC example notebook on Censored Data Models gives an example which is actually truncated data not censored data. More specifically, this line from the notebook which supposedly does censoring is actually truncation

# Censor samples
censored = samples[(samples > low) & (samples < high)]

# ^ NOPE

censoring would actually be

samples[samples<low]=low
samples[samples>high]=high

That page needs to be fixed… it’s not clear if the solution is correct for censoring, but just the data was accidentally truncated, or if the solution is correct for truncation but it was incorrectly about censoring.

Am trying to come up with a mini tutorial/demo of truncated and censored regression, but I’m doing it with Turing.jl at the moment

1 Like

I think there is some confusion going on about censored and truncated because… the statistical definition of truncation sounds like it’s doing the common sense definition of censoring (hiding values beyond a range) whereas statistical censoring defines the common sense of truncation (clipping values above or below an edge to that edge). I am following STAN’s definition here: Stan User’s Guide

This discourse discusses an example of exponentially truncated data: Help with pm.Potential

And there is one example of censored data in this Notebook (resources/DataAnalysis.ipynb at master · pymc-devs/resources · GitHub ), section 5.5.

Also we might have an issue going on with both the TruncatedNormal distribution and pm.Bound which affect the gradient. If you think you have the right model and it’s not working try to use a non gradient based sampler such as Metropolis or Slice: pm.Bound and TruncatedNormal generate wrong gradients · Issue #4417 · pymc-devs/pymc3 · GitHub

A tutorial clarifying these things would be really awesome btw!

1 Like

Bearing in mind that I’m not a mathematician or statistician, I have a very short truncated regression example here, written in Julia with the Turing.jl (sorry I’m cheating on PyMC3)

The model specification is simple, and from an end-user perspective, is simple and straightforward. Am just waiting on some feedback from the community on it’s correctness

(@ricardoV94 yes, I also work with those definitions in the link to the STAN manual. I think they make most sense and should be what people use)

1 Like

And I agree that we should fix the notebook confusion between censoring and truncation. I will try and review it when I have some time (or if you would like to do it since you have been exploring this topic already, that would be even better).

As promised… a simple worked example of censored regression, although it’s in Turing.jl

Sounds like it could be a good idea. Although I am only just getting my head around it all. At some point I will try to translate the truncated and censored regression demos from Turing.jl to PyMC3. I’m more than happy to contribute those examples if it’s just a case of making a pull request with a Jupiter notebook? I’m certainly not averse to contributing to docs… I have made extensive use of PyMC3 and I’m grateful for all the development that’s gone in :slight_smile:

Yes a Truncated and Censored regression example would be really great. I don’t think we have anything like that in our examples. For the logistics you would do a PR at GitHub - pymc-devs/pymc-examples: Examples of PyMC3 models, including a library of Jupyter notebooks. and then some kind devs will walk you through if there is any problem.

You are right, what i said previously is not correct

we cannot just use TruncateNormal, but need to added the logcdf and logccdffor the censored data to the model logp using pm.Potential

2 Likes

Before getting to the censored, I thought I’d check truncated regression. Here’s a minimum working example. I believe it’s doing the right thing. However we do get Theano warnings when sampling the truncated regression model:

WARNING (theano.tensor.opt): The Op erfcx does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.

Truncated regression minimum working example

import numpy as np
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import pymc3 as pm
import arviz as az

def truncate_y(x, y, bounds):
    keep = (y >= bounds[0]) & (y <= bounds[1])
    return (x[keep], y[keep])

# generate true latent data
m, c, σ, N = 1, 0, 2, 200
x = np.random.uniform(-10, 10, N)
y = np.random.normal(m * x + c, σ)
bounds = [-5, 5]

# generate observed, truncated data
xt, yt = truncate_y(x, y, bounds)

First we’ll run linear regression on the truncated data to confirm that it underestimates the slope

def linear_regression(x, y):

    with pm.Model() as model:
        m = pm.Normal("m", mu=0, sd=1)
        c = pm.Normal("c", mu=0, sd=1)
        σ = pm.HalfNormal("σ", sd=1)
        y_likelihood = pm.Normal("y_likelihood", mu=m*x+c, sd=σ, observed=y)

    with model:
        trace = pm.sample()

    return model, trace

# run the model on the truncated data (xt, yt)
linear_model, linear_trace = linear_regression(xt, yt)
az.plot_posterior(linear_trace, var_names=['m'], ref_val=m)

Plotting samples from the posterior (code not shown) make it very visually clear why the bias occurs.

Now we’ll implement the truncated regression model to check it no longer displays this bias.

def truncated_regression(x, y, bounds):

    with pm.Model() as model:
        m = pm.Normal("m", mu=0, sd=1)
        c = pm.Normal("c", mu=0, sd=1)
        σ = pm.HalfNormal("σ", sd=1)

        y_likelihood = pm.TruncatedNormal(
            "y_likelihood",
            mu=m * x + c,
            sd=σ,
            observed=y,
            lower=bounds[0],
            upper=bounds[1],
        )
    
    with model:
        trace = pm.sample()

    return model, trace


# run the model on the truncated data (xt, yt)
truncated_model, truncated_trace = truncated_regression(xt, yt, bounds)
az.plot_posterior(truncated_trace, var_names=['m'], ref_val=m)

and the estimate is good :slight_smile:

Double checking with plotting from the posterior (code not shown) confirms it’s all good.

Censored regression (first stab… not yet working properly)

I think I’m doing something wrong with the pm.Potential's. Any ideas? @junpenglao?

import numpy as np
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import pymc3 as pm
from pymc3.distributions.dist_math import normal_lccdf, normal_lcdf
import arviz as az
from copy import copy

def censor_y(x, y, bounds):
    # create vector of -1, 0, 1 labels for censorship type
    censor = np.zeros(y.shape[0])
    censor[y<=bounds[0]] = bounds[0]
    censor[y>=bounds[1]] = bounds[1]
    
    # censor the y values
    cy = copy(y)
    cy[y<=bounds[0]] = bounds[0]
    cy[y>=bounds[1]] = bounds[1]
    return (x, cy)

# generate true unobserved data
m, c, σ, N = 1, 0, 2, 200
x = np.random.uniform(-10, 10, N)
y = np.random.normal(m * x + c, σ)
bounds = [-5, 5]

# generate observed, censored data
xc, yc = censor_y(x, y, bounds)

# Helper functions for unimputed censored model
def left_censored_likelihood(mu, sigma, n_left_censored, lower_bound):
    """ Likelihood of left-censored data. """
    return n_left_censored * normal_lcdf(mu, sigma, lower_bound)


def right_censored_likelihood(mu, sigma, n_right_censored, upper_bound):
    """ Likelihood of right-censored data. """
    return n_right_censored * normal_lccdf(mu, sigma, upper_bound)


def censored_regression(x, y, bounds):
    
    # data pre-processing
    n_left_censored = len(y[y <= bounds[0]])
    n_right_censored = len(y[y >= bounds[1]])
    # filter censored values out of y
    uncensored = (y>bounds[0]) & (y<bounds[1])
    x, y = x[uncensored], y[uncensored]

    with pm.Model() as model:
        m = pm.Normal("m", mu=0, sd=1)
        c = pm.Normal("c", mu=0, sd=1)
        σ = pm.HalfNormal("σ", sd=1)
        
        # uncensored data
        y_likelihood = pm.Normal("y_likelihood", mu=m * x + c, sd=σ, observed=y)
        
        left_censored = pm.Potential(
            "left_censored", 
            left_censored_likelihood(bounds[0], σ, n_left_censored, bounds[0])
        )
        
        right_censored = pm.Potential(
            "right_censored", 
            right_censored_likelihood(bounds[1], σ, n_right_censored, bounds[1])
        )
    
    
    with model:
        trace = pm.sample()

    return model, trace


# run the model on the censored data (xc, yc)
censored_model, censored_trace = censored_regression(xc, yc, bounds)

az.plot_posterior(censored_trace, var_names=['m'], ref_val=m)

Estimate on the slope is reasonable, but could be better. Not sure if this is just a hard problem, or if I’ve not quite got the model right? Have tried it with a few different random datasets and the same slight underestimate happens. Fit should be better than this with a proper censored regression model.


I’m pretty convinced the issue is with not implementing the pm.Potential correctly. For example the model is unaware of the x values of the censored data.