So it looks like using
TruncatedNormal for truncated data is inappropriate and that should be used for data which is censored (not truncated). And that for truncated data, bounded normals are appropriate.
Which makes my previous example of truncated regression https://github.com/drbenvincent/pymc3-demo-code/blob/master/truncated_regression_pymc3.ipynb wrong (as it used
I changed that to use Bounds…
BoundedNormal = pm.Bound(pm.Normal, lower=truncation_bounds, upper=truncation_bounds)
y_likelihood = BoundedNormal("y_likelihood", mu=m * x + c, sd=σ, observed=y)
but it looks like we can’t use bounds for observed data…
ValueError: Observed Bound distributions are not supported. If you want to model truncated data you can use a pm.Potential in combination with the cumulative probability function. See pymc3/examples/censored_data.py for an example.
This points you to an example on censored data, which seems inappropriate given the task is truncated data. And it seems frustrating that you can’t use a bounded/truncated normal with observed data - but I don’t know if there are technical reasons for that. Either way it would be good if you could.
Is there a concrete example/explanation of how you can model truncated (not censored) observed data? Any ideas very welcome.
I just noticed that the wikipedia article on the truncated normal distribution actually says this:
[The truncated normal distribution] is used to model … censored data in the Tobit model.
So, as you say, a truncated distribution is not appropriate for truncated data. So it seems that “truncated” is, rather unfortunately, a statistical homonym.
If I understand this correctly, in censored data, you know which boundary an extreme datum has crossed, and the typical way to codify this is to set said datum to said boundary. By contrast, in truncated data, you don’t know anything about extreme data, including how many there are or where they were in the data set. As it says here:
Truncation is similar to but distinct from the concept of statistical censoring. A truncated sample can be thought of as being equivalent to an underlying sample with all values outside the bounds entirely omitted, with not even a count of those omitted being kept. With statistical censoring, a note would be recorded documenting which bound (upper or lower) had been exceeded and the value of that bound. With truncated sampling, no note is recorded.
Logically, then, there is actually no proper solution to the problem of truncated data. The best you can do is Heckman correction, but I haven’t read up on this and don’t know if it makes sense in a Bayesian context.
Good educational info here. It clears up some of my confusion. Thanks
There are apparently some regression models for truncated data, but the Wikipedia page is vague. I’ve ordered a second hand copy of
Breen, Richard (1996). Regression Models : Censored, Samples Selected, or Truncated Data. Thousand Oaks
in the hope that it will be useful.
For my particular problem at hand, I’ve decided to go for maximum likelihood methods using:
- for truncated data… a
truncnormal distribution from
script.stats. It ‘works’ in terms of eliminating bias in regression slope estimates. But now it’s not clear that this is kosher at all. I was under the impression this was setting likelihood outside of the bounds to zero, but will
- for censored data… a custom likelihood. Although now it might be that
truncnormal distribution from
script.stats actually does this properly.
I may have been fooled into thinking
scipy.stats was doing what I wanted for truncated data because
import scipy.stats as stats
import numpy as np
bounds = [-0.5, 1]
mu = 0.5
sd = 0.5
a = (bounds - mu) / sd
b = (bounds - mu) / sd
x = np.linspace(-1, 3, 1000)
plt.plot(x, stats.truncnorm.pdf(x, a=a, b=b, loc=mu, scale=sd))
But I should really have checked the code to see what it does.
Yep, that is where I got Heckman correction from. No idea how to implement in PyMC3 though.
My journey of confusion continues… The wikipedia article does state that a truncated normal distribution is used when you have censored data. However the definition of the probability density seems to be zero outside the bounds, and a re-scaled normal within the bounds. The re-scaling is to account for the probability mass you chopped off outside the bounds. So a mathematical truncated normal is actually a bounded normal (in PyMC3 terminology)?
What seems intuitive (for censored data) is that you should assign the probability mass of the bounds to data points observed at the censor bounds as in…
So I’m confused with conflicting information.
This is really really not explained in an accessible way, at least to me as a non professional statistician. Maybe it will be clearer when this book turns up, but if anyone has any recommendations for clear and accessible info on censored (or truncated) regression then I’d be interested.
I did some simulations (just ML now, no PyMC3) and it seems that using a custom likelihood which does add the prob mass of truncated regions to the bounds does solve the bias (red circles). Regular regression with normal likelihood (black circles) underestimates the slope. Using
scipy.stats.truncnorm just gives you bias in the opposite direction (blue circles) !
I think I might take this to stats.stackexchange.com as this is no longer really about PyMC3 as such.
One day I’ll get my head around this.
pm.Bound does not do re-scaling, so a mathematical truncated normal is indeed
It is a bit weird that the
scipy.stats.truncnorm gives bias in opposite direction tho
TruncatedNormal results in many chain failures when ever I try to use it.