Hi all,
I’ve been reading a little about using EM to fit regression models with some censored data. I wondered about using pymc3 to fit the same data to compare the results between the two methods but am having some difficulties.
The data and some visualisations of the data can be found here:
https://jackmedley.gitlab.io/blog/post/em_algorithm_2/
In summary; Some linear data is generated but then censored above some threshold line. The threshold line has non-zero gradient so the censored values for the data point is different.
I have tried implementing this in pymc3 as follows:
import pymc3 as pm
with pm.Model() as model:
b0 = pm.Normal('b0', 0.0, 100.0)
b1 = pm.Normal('b1', 0.0, 100.0)
sigma = pm.HalfCauchy('sigma', 100.0)
zz = [
pm.TruncatedNormal(f'Z_{itr}', b0 + b1 * x, sigma, lower=y)
for itr, (x, y) in enumerate(zip(xx[censored], yy[censored]))
]
pm.Normal('y_observed', b0 + b1 * xx[~censored], sigma, observed=yy[~censored])
trace = pm.sample(1000, tune=1000, cores=16)
As you can I’ve drawn each z
(which in the notation of the link above is the latent variable we don’t get to observe) from a truncated normal which ‘starts’ at the censored value, and I’ve drawn the observed y
values (not censored) from a Normal distribution.
When I sample this it takes quite a lot longer than an ordinary linear regression, presumably because I’ve added so many extra z
variables, but the result is the same as I get when I run a regular OLS:
with pm.Model() as model:
b0, b1 = pm.Normal('b0', 0.0, 100.0), pm.Normal('b1', 0.0, 100.0)
sigma = pm.HalfCauchy('sigma', 100.0)
pm.Normal('Y', b0 + b1 * xx[~censored], sigma, observed=yy[~censored])
trace = pm.sample(1000, tune=1000, cores=16)
Any tips for where I’m going wrong would be much appreciated. And if anyone knows a way of avoiding my zz
hack that would be great too!
Thanks!
Jack