Dear all,
I am very new to Bayesian approach and pymc3.
I am struggling with how to apply the censored model (https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/censored_data.py) to x-y data set.
Let’s start from a simple case:
Suppose we are measuring two linearly related physical quantity x and y (y = 1 + 0.7x),
but y is only measurable between certain intervals, which is x-dependent (y < 20 + 0.5x and y > 1.5 + 0.5x). So the data is consist of (1) samples with known x and y, (2) samples with known x and upper/lower limits of y.
I plan to fit the observed x-y with a linear relation. If the data censoring is well handled, ideally I will expect to get a original slope of ~0.7, otherwise we will get a fake slope of 0.5.
Here is my code to do the model fitting, basically the same as the unimputed censored model in the example.
import numpy as np
import pymc3 as pm
import theano.tensor as tt
def model(A,x):
return A[0]+x*A[1]
# Generate Testing data, y= 1+ 0.7x
n = 1000
test_x = np.random.uniform(low=1,high=100,size=n)
test_y = model([1,0.7],test_x)+np.random.normal(0, 10, n)
# Censored data
left_x = test_x [test_x < model([1.5,0.5], test_x)]
right_x = test_x [test_x > model([20.,0.5], test_x)]
detect_x = test_x [(test_y > model([1.5,0.5], test_x)) & (test_y < model([20.,0.5], test_x))]
detect_y = test_y [(test_y > model([1.5,0.5], test_x)) & (test_y < model([20.,0.5], test_x))]
# Count the number of detection, lower limits, and upper limits
n_right=len(right_x)
n_left=len(left_x)
n_detect=len(detect_x)
#----------------------------------------------------------------------------------
# Define the Likelihood function for censored model
#----------------------------------------------------------------------------------
def normal_lcdf(mu, sigma, x):
z = (x - mu) / sigma
return tt.switch(
tt.lt(z, -1.0),
tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)
)
def normal_lccdf(mu, sigma, x):
z = (x - mu) / sigma
return tt.switch(
tt.gt(z, 1.0),
tt.log(tt.erfcx(z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
tt.log1p(-tt.erfc(-z / tt.sqrt(2.)) / 2.)
)
def censored_right_likelihood(mu, sigma, n_right_censored, upper_bound):
return n_right_censored * normal_lccdf(mu, sigma, lower_bound)
def censored_left_likelihood(mu, sigma, n_left_censored, lower_bound):
return n_left_censored * normal_lcdf(mu, sigma, upper_bound)
# Model, fit x-y with y = beta + alpha*x
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
alpha = pm.Uniform(r'$\alpha$', lower=0.1, upper=1)
beta = pm.Uniform(r'$\beta$', lower=0.1, upper=10)
sd = pm.Uniform(r'$\sigma$', lower=1, upper=20)
# In real data, the boundary could be slightly uncertain
upper_beta = pm.Uniform('upper_beta', lower=0.1, upper=30)
lower_beta = pm.Uniform('lower_beta', lower=0.1, upper=30)
# Expected value of outcome
model_y = model([beta,alpha], detect_x)
left_y = model([beta,alpha], left_x)
right_y = model([beta,alpha], right_x)
upper_bound= upper_beta + (detect_x*0.5)
lower_bound= lower_beta + (detect_x*0.5)
# Likelihood of observations
likelihood_y = pm.Normal('y', mu=model_y, sd=sd, observed=detect_y,shape=n_detect)
right_censored = pm.Potential('right_censored',censored_right_likelihood(model_y, sd, n_right, lower_bound))
left_censored = pm.Potential('left_censored',censored_left_likelihood(model_y, sd, n_left, upper_bound))
step = pm.Metropolis()
# draw 10000 posterior samples
trace = pm.sample(10000,tune=1000, step=step,init='ADVI')
Although the code can run smoothly, but this model contains no information of how the number of left/right censored data varies with x. It is not surprising that the obtained slope is about 0.5, instead of the correct slope 0.7.
I am wondering what is the proper way to handle these kinds of censored data with variable boundaries?
I’d really appreciate any help and advice.
Thanks