Modeling censored x-y data set with x-dependent boundary

Dear all,

I am very new to Bayesian approach and pymc3.
I am struggling with how to apply the censored model (https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/censored_data.py) to x-y data set.

Let’s start from a simple case:
Suppose we are measuring two linearly related physical quantity x and y (y = 1 + 0.7x),
but y is only measurable between certain intervals, which is x-dependent (y < 20 + 0.5x and y > 1.5 + 0.5x). So the data is consist of (1) samples with known x and y, (2) samples with known x and upper/lower limits of y.

I plan to fit the observed x-y with a linear relation. If the data censoring is well handled, ideally I will expect to get a original slope of ~0.7, otherwise we will get a fake slope of 0.5.
Here is my code to do the model fitting, basically the same as the unimputed censored model in the example.

import numpy as np
import pymc3 as pm
import theano.tensor as tt

def model(A,x):
	return A[0]+x*A[1]

# Generate Testing data, y= 1+ 0.7x
n = 1000
test_x = np.random.uniform(low=1,high=100,size=n)
test_y = model([1,0.7],test_x)+np.random.normal(0, 10, n)
 
# Censored data
left_x = test_x [test_x < model([1.5,0.5], test_x)]
right_x = test_x [test_x > model([20.,0.5], test_x)]
detect_x = test_x [(test_y > model([1.5,0.5], test_x)) & (test_y < model([20.,0.5], test_x))]
detect_y = test_y [(test_y > model([1.5,0.5], test_x)) & (test_y < model([20.,0.5], test_x))]

# Count the number of detection, lower limits, and upper limits
n_right=len(right_x)
n_left=len(left_x)
n_detect=len(detect_x)

#----------------------------------------------------------------------------------
# Define the Likelihood function for censored model
#----------------------------------------------------------------------------------
def normal_lcdf(mu, sigma, x):
    z = (x - mu) / sigma
    return tt.switch(
        tt.lt(z, -1.0),
        tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)
    )

def normal_lccdf(mu, sigma, x):
    z = (x - mu) / sigma
    return tt.switch(
        tt.gt(z, 1.0),
        tt.log(tt.erfcx(z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(-z / tt.sqrt(2.)) / 2.)
    )

def censored_right_likelihood(mu, sigma, n_right_censored, upper_bound):
    return n_right_censored * normal_lccdf(mu, sigma, lower_bound)

def censored_left_likelihood(mu, sigma, n_left_censored, lower_bound):
    return n_left_censored * normal_lcdf(mu, sigma, upper_bound)

# Model, fit x-y with y = beta + alpha*x
basic_model = pm.Model()
with basic_model:
	# Priors for unknown model parameters
	alpha = pm.Uniform(r'$\alpha$', lower=0.1, upper=1)
	beta = pm.Uniform(r'$\beta$', lower=0.1, upper=10)
	sd = pm.Uniform(r'$\sigma$', lower=1, upper=20)
        # In real data, the boundary could be slightly uncertain 
	upper_beta = pm.Uniform('upper_beta', lower=0.1, upper=30)
	lower_beta = pm.Uniform('lower_beta', lower=0.1, upper=30)

	# Expected value of outcome
	model_y = model([beta,alpha], detect_x)
	left_y = model([beta,alpha], left_x)
	right_y = model([beta,alpha], right_x)
	upper_bound= upper_beta + (detect_x*0.5)
	lower_bound= lower_beta + (detect_x*0.5)

	# Likelihood of observations
	likelihood_y = pm.Normal('y', mu=model_y, sd=sd, observed=detect_y,shape=n_detect)
	right_censored = pm.Potential('right_censored',censored_right_likelihood(model_y, sd, n_right, lower_bound))
left_censored = pm.Potential('left_censored',censored_left_likelihood(model_y, sd, n_left, upper_bound))

	step = pm.Metropolis()
    # draw 10000 posterior samples
	trace = pm.sample(10000,tune=1000, step=step,init='ADVI')

Although the code can run smoothly, but this model contains no information of how the number of left/right censored data varies with x. It is not surprising that the obtained slope is about 0.5, instead of the correct slope 0.7.

I am wondering what is the proper way to handle these kinds of censored data with variable boundaries?
I’d really appreciate any help and advice.
Thanks

Hi!

If I understand you correctly, what makes this tricky is that each censored data point has been censored by different thresholds, right?

Then probably the most naive solution would be to compute normal_lcdf (or normal_lccdf) for every censored data point, and pass their sum into a pm.Potential. In other words, if the data point at x has been censored, pass 1.5 + 0.5x into normal_lcdf (or 20 + 0.5x into normal_lccdf, depending on whether it’s censored from above or below). Do this for every censored data point, add them all up, and pass that into your pm.Potential.

Let me know if that helps/if it works!

Btw, the python script you referenced has been ported to a notebook and its now here

1 Like

Hi,

Thanks for your suggestion.
I tried to pass the x-dependent thresholds to pm.Potential for each censored data via

    left_y = model([beta,alpha], left_x)
    right_y = model([beta,alpha], right_x)
    lower_bound= lower_beta + (left_x*0.5)
    upper_bound= upper_beta + (right_x*0.5)
    right_censored = pm.Potential('right_censored',censored_right_likelihood(right_y, sd, n_right, upper_bound))
    left_censored = pm.Potential('left_censored',censored_left_likelihood(left_y, sd, n_left, lower_bound))

The model seems to work. Here shows the posterior prediction

However the results vary a lot with different prior.
If I used a uniform priors with large possible range, the derived slope could change from 0.5 to 0.8.
If I used priors with much narrower range, the results could be very close to the input slope.
I guess this is just caused by the model complexity.

Please let me know if I do something wrong, or is there anyway to improve the model.
Thanks

Hi!

Glad to hear that it looks like it’s working!

About your choice of priors: it’s generally a bad idea to put a uniform prior over a large interval. A better choice would be a normal distribution. That way, you don’t assign zero probability to any value, but express your belief that a coefficient of e.g. 5 is more reasonable than 500. For parameters that must be non-negative (e.g. standard deviations) you can use a half-normal, or an exponential prior.

Hope this helps!