WAIC and In-Sample Deviance

adam · October 3, 2018, 12:20am

I have limited sample of age and height data (d) take from the full data set from McElreath’s book: Statistical Rethinking age_height.csv (1.9 KB)

I divided the data into training and testing sets:

train, test = train_test_split(d, test_size=0.5, random_state=13)

Then, I build 6 polynomial models with FLAT, UN-INFORMATIVE priors. Here’s an example of model number 3 with three parameters.

age_m3 = np.vstack((train.age_std,train.age_std**2,train.age_std**3))

with pm.Model() as m3:
    alpha = pm.Normal('alpha', mu=0, sd=200, testval=a_start)
    beta = pm.Normal('beta', mu=0, sd=50, shape=3)
    sigma = pm.HalfCauchy('sigma', beta=30, testval=sigma_start)
    mu = pm.Deterministic('mu', alpha + pm.math.dot(beta, age_m3))
    Height = pm.Normal('Height', mu=mu, sd=sigma, observed=train['height'])
    trace_m3 = pm.sample(1000, tune=1000, random_seed=13)

I also build 6 polynomial models with INFORMATIVE priors. Here’s an example of model number 3 again.

with pm.Model() as m3_tighter_prior:
    alpha = pm.Normal('alpha', mu=0, sd=100, testval=a_start)
    beta = pm.Normal('beta', mu=0, sd=10, shape=3)
    sigma = pm.HalfCauchy('sigma', beta=10, testval=sigma_start)
    mu = pm.Deterministic('mu',alpha + pm.math.dot(beta, age_m3))
    Height = pm.Normal('Height', mu=mu, sd=sigma, observed=train['height'])
    trace_m3_tighter_prior = pm.sample(1000, tune=1000, random_seed=13)

To finish, I calculate the in-sample deviance and the WAIC value (estimated out-of-sample deviance) for informative and un-informative priors.

For example, here is the in-sample deviance for model 3 of informative priors.

theta_3_tighter_prior = pm.summary(trace_m3_tighter_prior)['mean'][:5]

dev_3_in_sample_tighter_prior = - 2 * sum(stats.norm.logpdf(train['height'], 
                                    loc = theta_3_tighter_prior[0] + 
                                    theta_3_tighter_prior[1] * train['age_std'] +
                                    theta_3_tighter_prior[2] * train['age_std']**2 +
                                    theta_3_tighter_prior[3] * train['age_std']**3, 
                                    scale = theta_3_tighter_prior[4]))

WAIC value of model 3 of informative priors.

waic_tighter_3 = pm.waic(trace_m3_tighter_prior, m3_tighter_prior).WAIC

Plotting all deviances:

As expected, simulated in-sample deviances (red/green) are lower than WAIC values (estimated out-of-sample deviances, blue/gray), because in-sample deviances tend to overfit the data.

My question:
1. In theory, the WAIC values for informative-prior models are supposed to be lower than WAIC values for un-informative-prior models. Why is this the case only for models with 3 and 4 parameters?

2. The difference between green and red (in-sample deviances of informative and un-informative priors) is smallest at model 3. Does that mean anything?

Question 3:
Models with UN-informative priors

	m1	m2	m3	m4	m5	m6
alpha	145.860602	156.110797	156.689380	156.944409	156.026089	156.222594
sigma	16.107881	12.437511	10.077042	10.316230	10.010530	9.983804

Models with informative priors

	m1	m2	m3	m4	m5	m6
alpha	145.484445	153.437437	154.973432	153.968937	154.249675	154.171092
sigma	15.919727	12.373770	9.781053	9.873798	10.223518	10.208682

As more parameters are added, the model residual (sigma) of models with UN-informative priors decrease. However, with informative priors, the lowest model residuals belong to model 3 and 4.

Is a lower model residual a sign that the variables of age explain more of the variation in height? Is this a sign that model 3 or 4 are the most accurate at Retrodicting height?

falk · October 3, 2018, 7:00am

Dear Adam,

though I have an intuition about your questions, which probably matches yours, I lack deeper insight and leave the comments on your questions to others. (Specifically: I have only used LOO/WAIC to compare different model structures, and have so far never regarded them as relevant for prior structure, so thank you for sharing this.)

However, regarding the first part:

You omit uncertainty in your plot. WAIC in pymc returns waic_se, so you should look at those as well to evaluate if the difference in WAIC is relevant.

Cheers,

Falk

adam · October 3, 2018, 2:35pm

Thank you, @falk for your help. I think I understand.

After doing the comparison between WAIC values, using pm.comapre, this is the table for the models of Regularized priors (blue):

	WAIC	pWAIC	dWAIC	weight	SE	dSE	var_warn
model
model_3	143.57	3.29	0	0.45	4	0	1
model_4	144.6	3.53	1.03	0.27	4.58	1.11	1
model_5	145.99	3.75	2.42	0.14	4.75	1.43	1
model_6	146.46	3.97	2.89	0.11	4.93	1.61	1
model_2	153.14	3.48	9.57	0.03	4.71	4.37	1
model_1	161.79	2.62	18.22	0	6.69	6.8	1

And this is for UN-Informative priors (gray):

	WAIC	pWAIC	dWAIC	weight	SE	dSE	var_warn
model
model_3	144.15	3.53	0	0.39	3.89	0	1
model_5	145.31	4.75	1.15	0.22	4.13	1.4	1
model_6	145.34	4.81	1.19	0.21	3.95	1.24	1
model_4	146.13	4.26	1.97	0.14	3.93	0.44	1
model_2	152.19	3.34	8.03	0.05	4.2	4.67	1
model_1	161.32	2.61	17.17	0	6.08	6.79	1

Then, I used their WAIC and SE value inside “np.random.normal()” and plotting both model 6’s.

Finally, I calculated the probability that model_6_INFORMATVE_prior’s WAIC value is actually lower than model_6_FLAT_prior’s WAIC value.

(model_6informative < model_6flat).mean()

Resulting in a 42.8% probability that it is lower/better.

That said, I’m not sure doing this is correct because deviance represents the relative distance of a model from perfect accuracy. By comparing the estimated deviance values between models with different priors, I’m afraid I’m treating them as absolute values or just not doing it right.

In fact, McElreath teaches us in chapter 6 that you should use dWAIC and dSE to compute probability of difference between models.

Any thoughts?

falk · October 3, 2018, 6:04pm

Thank you, @adam,
I think this shows that you cannot effectively select from or distinguish between polynomial degrees {3,4,5,6}, indicated by dSE > dWAIC. All of them capture the problem; on the other hand, the higher degrees tend to overfit, which is penalized by the WAIC (and even the in-sample deviance). Looking at your data plots and drawing polynomials in there in my mind, this makes intuitive sense.

To share yet a bit of my intuition, the prior should matter only little. In the limiting case of fully data-dominated posterior (infinite number of observations), the model fit is prior-independent, and all conceptually identical models should fit the data equally well. The small difference you observe between the different prior sets are within the SE range. They are not identical because your observations are limited in number.

We should try to understand “why” dWAIC is just the comfortable display of the difference to the best fitting model, and I suspect dSE is an adjusted standard error that quantifies the uncertainty range of that difference. One could as well use the WAIC and SE: I’m working on a case where my default model is not the WAIC champion, and the better ones are biologically implausible or suffer sampling problems.

Hope this helps!

Falk

adam · October 3, 2018, 9:35pm

thank you, @falk

Topic		Replies	Views
Computing out-of-sample deviance Questions	5	2390	January 30, 2018
Statistical Rethinking 2nd Ed, Ch4, Exercise 4H1 version agnostic	3	304	September 15, 2023
Out of sample/model predictions in hierarchical model with new observed data v5 modeling	9	211	October 16, 2024
Theoretical and Practical Considerations and Questions v5 development , modeling , sampling	0	23	September 13, 2024
WAIC values differing between identical fits? Questions	0	510	July 22, 2019

WAIC and In-Sample Deviance

Related topics