WAIC and In-Sample Deviance


#1

I have limited sample of age and height data (d) take from the full data set from McElreath’s book: Statistical Rethinking age_height.csv (1.9 KB)

I divided the data into training and testing sets:

train, test = train_test_split(d, test_size=0.5, random_state=13)

image
image

Then, I build 6 polynomial models with FLAT, UN-INFORMATIVE priors. Here’s an example of model number 3 with three parameters.

age_m3 = np.vstack((train.age_std,train.age_std**2,train.age_std**3))

with pm.Model() as m3:
    alpha = pm.Normal('alpha', mu=0, sd=200, testval=a_start)
    beta = pm.Normal('beta', mu=0, sd=50, shape=3)
    sigma = pm.HalfCauchy('sigma', beta=30, testval=sigma_start)
    mu = pm.Deterministic('mu', alpha + pm.math.dot(beta, age_m3))
    Height = pm.Normal('Height', mu=mu, sd=sigma, observed=train['height'])
    trace_m3 = pm.sample(1000, tune=1000, random_seed=13)

I also build 6 polynomial models with INFORMATIVE priors. Here’s an example of model number 3 again.

with pm.Model() as m3_tighter_prior:
    alpha = pm.Normal('alpha', mu=0, sd=100, testval=a_start)
    beta = pm.Normal('beta', mu=0, sd=10, shape=3)
    sigma = pm.HalfCauchy('sigma', beta=10, testval=sigma_start)
    mu = pm.Deterministic('mu',alpha + pm.math.dot(beta, age_m3))
    Height = pm.Normal('Height', mu=mu, sd=sigma, observed=train['height'])
    trace_m3_tighter_prior = pm.sample(1000, tune=1000, random_seed=13)

To finish, I calculate the in-sample deviance and the WAIC value (estimated out-of-sample deviance) for informative and un-informative priors.

For example, here is the in-sample deviance for model 3 of informative priors.

theta_3_tighter_prior = pm.summary(trace_m3_tighter_prior)['mean'][:5]

dev_3_in_sample_tighter_prior = - 2 * sum(stats.norm.logpdf(train['height'], 
                                    loc = theta_3_tighter_prior[0] + 
                                    theta_3_tighter_prior[1] * train['age_std'] +
                                    theta_3_tighter_prior[2] * train['age_std']**2 +
                                    theta_3_tighter_prior[3] * train['age_std']**3, 
                                    scale = theta_3_tighter_prior[4]))

WAIC value of model 3 of informative priors.

waic_tighter_3 = pm.waic(trace_m3_tighter_prior, m3_tighter_prior).WAIC

Plotting all deviances:

As expected, simulated in-sample deviances (red/green) are lower than WAIC values (estimated out-of-sample deviances, blue/gray), because in-sample deviances tend to overfit the data.

My question:
1. In theory, the WAIC values for informative-prior models are supposed to be lower than WAIC values for un-informative-prior models. Why is this the case only for models with 3 and 4 parameters?

2. The difference between green and red (in-sample deviances of informative and un-informative priors) is smallest at model 3. Does that mean anything?

Question 3:
Models with UN-informative priors

m1 m2 m3 m4 m5 m6
alpha 145.860602 156.110797 156.689380 156.944409 156.026089 156.222594
sigma 16.107881 12.437511 10.077042 10.316230 10.010530 9.983804

Models with informative priors

m1 m2 m3 m4 m5 m6
alpha 145.484445 153.437437 154.973432 153.968937 154.249675 154.171092
sigma 15.919727 12.373770 9.781053 9.873798 10.223518 10.208682

As more parameters are added, the model residual (sigma) of models with UN-informative priors decrease. However, with informative priors, the lowest model residuals belong to model 3 and 4.

Is a lower model residual a sign that the variables of age explain more of the variation in height? Is this a sign that model 3 or 4 are the most accurate at Retrodicting height?


#2

Dear Adam,

though I have an intuition about your questions, which probably matches yours, I lack deeper insight and leave the comments on your questions to others. (Specifically: I have only used LOO/WAIC to compare different model structures, and have so far never regarded them as relevant for prior structure, so thank you for sharing this.)

However, regarding the first part:

You omit uncertainty in your plot. WAIC in pymc returns waic_se, so you should look at those as well to evaluate if the difference in WAIC is relevant.

Cheers,

Falk


#3

Thank you, @falk for your help. I think I understand.

After doing the comparison between WAIC values, using pm.comapre, this is the table for the models of Regularized priors (blue):

WAIC pWAIC dWAIC weight SE dSE var_warn
model
model_3 143.57 3.29 0 0.45 4 0 1
model_4 144.6 3.53 1.03 0.27 4.58 1.11 1
model_5 145.99 3.75 2.42 0.14 4.75 1.43 1
model_6 146.46 3.97 2.89 0.11 4.93 1.61 1
model_2 153.14 3.48 9.57 0.03 4.71 4.37 1
model_1 161.79 2.62 18.22 0 6.69 6.8 1

And this is for UN-Informative priors (gray):

WAIC pWAIC dWAIC weight SE dSE var_warn
model
model_3 144.15 3.53 0 0.39 3.89 0 1
model_5 145.31 4.75 1.15 0.22 4.13 1.4 1
model_6 145.34 4.81 1.19 0.21 3.95 1.24 1
model_4 146.13 4.26 1.97 0.14 3.93 0.44 1
model_2 152.19 3.34 8.03 0.05 4.2 4.67 1
model_1 161.32 2.61 17.17 0 6.08 6.79 1

Then, I used their WAIC and SE value inside “np.random.normal()” and plotting both model 6’s.
image

Finally, I calculated the probability that model_6_INFORMATVE_prior’s WAIC value is actually lower than model_6_FLAT_prior’s WAIC value.

(model_6informative < model_6flat).mean()

Resulting in a 42.8% probability that it is lower/better.

That said, I’m not sure doing this is correct because deviance represents the relative distance of a model from perfect accuracy. By comparing the estimated deviance values between models with different priors, I’m afraid I’m treating them as absolute values or just not doing it right.

In fact, McElreath teaches us in chapter 6 that you should use dWAIC and dSE to compute probability of difference between models.

Any thoughts?


#4

Thank you, @adam,
I think this shows that you cannot effectively select from or distinguish between polynomial degrees {3,4,5,6}, indicated by dSE > dWAIC. All of them capture the problem; on the other hand, the higher degrees tend to overfit, which is penalized by the WAIC (and even the in-sample deviance). Looking at your data plots and drawing polynomials in there in my mind, this makes intuitive sense.

To share yet a bit of my intuition, the prior should matter only little. In the limiting case of fully data-dominated posterior (infinite number of observations), the model fit is prior-independent, and all conceptually identical models should fit the data equally well. The small difference you observe between the different prior sets are within the SE range. They are not identical because your observations are limited in number.

We should try to understand “why” :wink: dWAIC is just the comfortable display of the difference to the best fitting model, and I suspect dSE is an adjusted standard error that quantifies the uncertainty range of that difference. One could as well use the WAIC and SE: I’m working on a case where my default model is not the WAIC champion, and the better ones are biologically implausible or suffer sampling problems.

Hope this helps!

Falk


#5

thank you, @falk