PPC with Anscombe dataset 3 using a Student T Model

Hello,
I’m following Osvaldo Martin’s ‘Bayesian Analysis with Python’ book and am having trouble with the section on robust linear regression.

The example uses the 3rd dataset from the Anscombe Quartet, with the goal of using a T distribution to make the model robust to the outlier.

I’ve generated the model with

with pm.Model() as anscombe3_model_t:
    α = pm.Normal('α', mu=y_3.mean(), sd=1)
    β = pm.Normal('β', mu=0, sd=1)
    ϵ = pm.HalfCauchy('ϵ', 5)
    ν_ = pm.Exponential('ν_', 1/29)
    ν = pm.Deterministic('ν', ν_ + 1)

    y_pred = pm.StudentT('y_pred', mu=α+β*x_3, sd=ϵ, nu=ν, observed=y_3)

    trace_anscombe_t = pm.sample(2000, tune=1000)

Generating a fit line using the mean of the trace, I get what looks like the correct answer, i.e. a line which runs through all the points except the outlier

From this (and from the fact that the sd of α, β and ϵ and are all 0.0) I would expect the PPC to provide a pretty good estimate of y, but the samples generate by the following code show y predictions (and a mean of the y predictions) that are way off what I would expect, given how well the regression line fits, and also far from what Martin gets with the same code

ppc = pm.sample_posterior_predictive(trace_anscombe_t, samples=1000,
                                model=anscombe3_model_t)
data_ppc = az.from_pymc3(trace=trace_anscombe_t, posterior_predictive=ppc)
ax=az.plot_ppc(data_ppc, figsize=(8,5), mean=True)
plt.xlim(0,14)
plt.show()

My results
image

(I can only upload 1 image, but Martin’s results show a y_pred mean which is on the same scale as the observed results, whereas mine is way below it)

Am I doing something wrong with my PPC?
Thanks

@aloctavodia could you have a look?

Hi @RedPenguin101,

I am in the middle of nowhere without my notebook (and no copy of the book), but from what I remeber and the figure you posted it seems that the model is correctly predicting the mean values of y. Without “the outlier” the mean of the data is around 7. Maybe you are worried that the dashed-mean line is not as tall as expected, but this is a consequence of the t distibution having very thick tails
(for low values of nu).

I will check back on this tomorrow when I am back home.

The man himself, very cool. Really enjoying the book.

Thanks for the response, the dashed line not being as tall as expected is exactly what is throwing me - in your version the posterior predictive mean and all the samples are right up around the observed values (for y around 5 to 7), but mine are much lower, which I think means they are less predictive of the data.

Another thing that puzzles me is how far the mean is from the samples, i.e. at every point it’s below on the curve it’s below all the sample lines.

(should mention as well that the original code had a random seed on the ppc sample. Mine did not, but I do get the same result when I put it in.)

I will suggest you to play with SciPy’s Student T distribution and try plotting samples for low and high values of nu, so you get familiar with this distribution specially for low values of nu and how easily is to get values far away from the mean. I also guess the KDE estimation could be contributing to your confusion, as it could be overestimating the wide of the distribution and thus underestimating the high of the distribution. I will explore this issue and see if it can be improved. @RavinKumar mention me something similar in the last few days.