Modeling censored x-y data set with x-dependent boundary

jwwang · July 17, 2018, 11:19am

Dear all,

I am very new to Bayesian approach and pymc3.
I am struggling with how to apply the censored model (https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/censored_data.py) to x-y data set.

Let’s start from a simple case:
Suppose we are measuring two linearly related physical quantity x and y (y = 1 + 0.7x),
but y is only measurable between certain intervals, which is x-dependent (y < 20 + 0.5x and y > 1.5 + 0.5x). So the data is consist of (1) samples with known x and y, (2) samples with known x and upper/lower limits of y.

I plan to fit the observed x-y with a linear relation. If the data censoring is well handled, ideally I will expect to get a original slope of ~0.7, otherwise we will get a fake slope of 0.5.
Here is my code to do the model fitting, basically the same as the unimputed censored model in the example.

import numpy as np
import pymc3 as pm
import theano.tensor as tt

def model(A,x):
	return A[0]+x*A[1]

# Generate Testing data, y= 1+ 0.7x
n = 1000
test_x = np.random.uniform(low=1,high=100,size=n)
test_y = model([1,0.7],test_x)+np.random.normal(0, 10, n)
 
# Censored data
left_x = test_x [test_x < model([1.5,0.5], test_x)]
right_x = test_x [test_x > model([20.,0.5], test_x)]
detect_x = test_x [(test_y > model([1.5,0.5], test_x)) & (test_y < model([20.,0.5], test_x))]
detect_y = test_y [(test_y > model([1.5,0.5], test_x)) & (test_y < model([20.,0.5], test_x))]

# Count the number of detection, lower limits, and upper limits
n_right=len(right_x)
n_left=len(left_x)
n_detect=len(detect_x)

#----------------------------------------------------------------------------------
# Define the Likelihood function for censored model
#----------------------------------------------------------------------------------
def normal_lcdf(mu, sigma, x):
    z = (x - mu) / sigma
    return tt.switch(
        tt.lt(z, -1.0),
        tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)
    )

def normal_lccdf(mu, sigma, x):
    z = (x - mu) / sigma
    return tt.switch(
        tt.gt(z, 1.0),
        tt.log(tt.erfcx(z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(-z / tt.sqrt(2.)) / 2.)
    )

def censored_right_likelihood(mu, sigma, n_right_censored, upper_bound):
    return n_right_censored * normal_lccdf(mu, sigma, lower_bound)

def censored_left_likelihood(mu, sigma, n_left_censored, lower_bound):
    return n_left_censored * normal_lcdf(mu, sigma, upper_bound)

# Model, fit x-y with y = beta + alpha*x
basic_model = pm.Model()
with basic_model:
	# Priors for unknown model parameters
	alpha = pm.Uniform(r'$\alpha$', lower=0.1, upper=1)
	beta = pm.Uniform(r'$\beta$', lower=0.1, upper=10)
	sd = pm.Uniform(r'$\sigma$', lower=1, upper=20)
        # In real data, the boundary could be slightly uncertain 
	upper_beta = pm.Uniform('upper_beta', lower=0.1, upper=30)
	lower_beta = pm.Uniform('lower_beta', lower=0.1, upper=30)

	# Expected value of outcome
	model_y = model([beta,alpha], detect_x)
	left_y = model([beta,alpha], left_x)
	right_y = model([beta,alpha], right_x)
	upper_bound= upper_beta + (detect_x*0.5)
	lower_bound= lower_beta + (detect_x*0.5)

	# Likelihood of observations
	likelihood_y = pm.Normal('y', mu=model_y, sd=sd, observed=detect_y,shape=n_detect)
	right_censored = pm.Potential('right_censored',censored_right_likelihood(model_y, sd, n_right, lower_bound))
left_censored = pm.Potential('left_censored',censored_left_likelihood(model_y, sd, n_left, upper_bound))

	step = pm.Metropolis()
    # draw 10000 posterior samples
	trace = pm.sample(10000,tune=1000, step=step,init='ADVI')

Although the code can run smoothly, but this model contains no information of how the number of left/right censored data varies with x. It is not surprising that the obtained slope is about 0.5, instead of the correct slope 0.7.

I am wondering what is the proper way to handle these kinds of censored data with variable boundaries?
I’d really appreciate any help and advice.
Thanks

_eigenfoo · July 19, 2018, 12:36pm

Hi!

If I understand you correctly, what makes this tricky is that each censored data point has been censored by different thresholds, right?

Then probably the most naive solution would be to compute normal_lcdf (or normal_lccdf) for every censored data point, and pass their sum into a pm.Potential. In other words, if the data point at x has been censored, pass 1.5 + 0.5x into normal_lcdf (or 20 + 0.5x into normal_lccdf, depending on whether it’s censored from above or below). Do this for every censored data point, add them all up, and pass that into your pm.Potential.

Let me know if that helps/if it works!

_eigenfoo · July 19, 2018, 12:37pm

Btw, the python script you referenced has been ported to a notebook and its now here

jwwang · July 23, 2018, 8:39am

Hi,

Thanks for your suggestion.
I tried to pass the x-dependent thresholds to pm.Potential for each censored data via

    left_y = model([beta,alpha], left_x)
    right_y = model([beta,alpha], right_x)
    lower_bound= lower_beta + (left_x*0.5)
    upper_bound= upper_beta + (right_x*0.5)
    right_censored = pm.Potential('right_censored',censored_right_likelihood(right_y, sd, n_right, upper_bound))
    left_censored = pm.Potential('left_censored',censored_left_likelihood(left_y, sd, n_left, lower_bound))

The model seems to work. Here shows the posterior prediction

However the results vary a lot with different prior.
If I used a uniform priors with large possible range, the derived slope could change from 0.5 to 0.8.
If I used priors with much narrower range, the results could be very close to the input slope.
I guess this is just caused by the model complexity.

Please let me know if I do something wrong, or is there anyway to improve the model.
Thanks

_eigenfoo · July 23, 2018, 11:59am

Hi!

Glad to hear that it looks like it’s working!

About your choice of priors: it’s generally a bad idea to put a uniform prior over a large interval. A better choice would be a normal distribution. That way, you don’t assign zero probability to any value, but express your belief that a coefficient of e.g. 5 is more reasonable than 500. For parameters that must be non-negative (e.g. standard deviations) you can use a half-normal, or an exponential prior.

Hope this helps!

Topic		Replies	Views
Censored linear regression with non-constant censoring threshold Questions	2	748	May 26, 2020
Advice on Modeling Censored Data from a Ratio of Variables version agnostic	1	21	March 13, 2025
Modelling censored data in pymc v4: a simple Gaussian random effects example v5 development , modeling	0	387	August 30, 2022
Regression with censored response variable Questions	0	577	October 24, 2019
Translating Censored Data example from PyMC2 to PyMC3 Questions	2	705	July 27, 2017

Modeling censored x-y data set with x-dependent boundary

Related topics