PPC with Anscombe dataset 3 using a Student T Model

RedPenguin101 · February 24, 2019, 4:53am

Hello,
I’m following Osvaldo Martin’s ‘Bayesian Analysis with Python’ book and am having trouble with the section on robust linear regression.

The example uses the 3rd dataset from the Anscombe Quartet, with the goal of using a T distribution to make the model robust to the outlier.

I’ve generated the model with

with pm.Model() as anscombe3_model_t:
    α = pm.Normal('α', mu=y_3.mean(), sd=1)
    β = pm.Normal('β', mu=0, sd=1)
    ϵ = pm.HalfCauchy('ϵ', 5)
    ν_ = pm.Exponential('ν_', 1/29)
    ν = pm.Deterministic('ν', ν_ + 1)

    y_pred = pm.StudentT('y_pred', mu=α+β*x_3, sd=ϵ, nu=ν, observed=y_3)

    trace_anscombe_t = pm.sample(2000, tune=1000)

Generating a fit line using the mean of the trace, I get what looks like the correct answer, i.e. a line which runs through all the points except the outlier

From this (and from the fact that the sd of α, β and ϵ and are all 0.0) I would expect the PPC to provide a pretty good estimate of y, but the samples generate by the following code show y predictions (and a mean of the y predictions) that are way off what I would expect, given how well the regression line fits, and also far from what Martin gets with the same code

ppc = pm.sample_posterior_predictive(trace_anscombe_t, samples=1000,
                                model=anscombe3_model_t)
data_ppc = az.from_pymc3(trace=trace_anscombe_t, posterior_predictive=ppc)
ax=az.plot_ppc(data_ppc, figsize=(8,5), mean=True)
plt.xlim(0,14)
plt.show()

My results

(I can only upload 1 image, but Martin’s results show a y_pred mean which is on the same scale as the observed results, whereas mine is way below it)

Am I doing something wrong with my PPC?
Thanks

junpenglao · February 24, 2019, 8:56am

@aloctavodia could you have a look?

aloctavodia · February 24, 2019, 10:47am

Hi @RedPenguin101,

I am in the middle of nowhere without my notebook (and no copy of the book), but from what I remeber and the figure you posted it seems that the model is correctly predicting the mean values of y. Without “the outlier” the mean of the data is around 7. Maybe you are worried that the dashed-mean line is not as tall as expected, but this is a consequence of the t distibution having very thick tails
(for low values of nu).

I will check back on this tomorrow when I am back home.

RedPenguin101 · February 24, 2019, 1:18pm

The man himself, very cool. Really enjoying the book.

Thanks for the response, the dashed line not being as tall as expected is exactly what is throwing me - in your version the posterior predictive mean and all the samples are right up around the observed values (for y around 5 to 7), but mine are much lower, which I think means they are less predictive of the data.

Another thing that puzzles me is how far the mean is from the samples, i.e. at every point it’s below on the curve it’s below all the sample lines.

(should mention as well that the original code had a random seed on the ppc sample. Mine did not, but I do get the same result when I put it in.)

aloctavodia · February 26, 2019, 12:45pm

I will suggest you to play with SciPy’s Student T distribution and try plotting samples for low and high values of nu, so you get familiar with this distribution specially for low values of nu and how easily is to get values far away from the mean. I also guess the KDE estimation could be contributing to your confusion, as it could be overestimating the wide of the distribution and thus underestimating the high of the distribution. I will explore this issue and see if it can be improved. @RavinKumar mention me something similar in the last few days.

Topic		Replies	Views
Trouble understanding sample_ppc Questions	4	637	June 7, 2018
Prediction using sampling Questions	0	401	September 16, 2019
Question about how to force y_pred to positive numbers and advice seeking on my exercises v3 modeling	2	618	May 15, 2023
Prediction using sample_ppc in Hierarchical model Questions from_github	6	4973	December 14, 2017
PPC for data with known measurement uncertainty Questions	3	1279	July 26, 2018

PPC with Anscombe dataset 3 using a Student T Model

Related topics