Bayesian model averaging: ranking of model weights and LOO don't match

sammosummo · March 10, 2020, 4:56pm

I recently compared some models using pm.compare. When I use “stacking”, the rankings by LOO don’t match those by weight:

	index	rank	loo	p_loo	d_loo	weight	se	dse	warning	loo_scale
0	Full	0	-1032.49	58.5355	0	0.785312	19.6743	0	1	log
1	Decay only	1	-1039.55	53.1845	7.06546	0.127025	21.1935	6.6104	1	log
2	Death only	2	-1077.52	62.0865	45.0306	1.06858e-05	21.1415	11.0959	1	log
3	Null	3	-1136.47	47.9011	103.986	0.087652	26.6238	18.8582	0	log

Has anyone seen this happen before? Perhaps it has something to do with the warnings?

AlexAndorra · March 10, 2020, 5:29pm

Hi Sam!
Which version of ArviZ are you using?
In the most recent (0.7.0), the default scale is now log (as in your picture), which means that the higher the LOO, the better the model. So, in that case, I think the LOO rankings match the rankings by weights.

sammosummo · March 10, 2020, 5:46pm

I’m using ArviZ 0.7.0 and they definitely don’t match. Here is the same table ordered by LOO:

	index	rank	loo	p_loo	d_loo	weight	se	dse	warning	loo_scale
3	Null	3	-1136.47	47.9011	103.986	0.087652	26.6238	18.8582	0	log
2	Death only	2	-1077.52	62.0865	45.0306	1.06858e-05	21.1415	11.0959	1	log
1	Decay only	1	-1039.55	53.1845	7.06546	0.127025	21.1935	6.6104	1	log
0	Full	0	-1032.49	58.5355	0	0.785312	19.6743	0	1	log

And again by weight:

	index	rank	loo	p_loo	d_loo	weight	se	dse	warning	loo_scale
2	Death only	2	-1077.52	62.0865	45.0306	1.06858e-05	21.1415	11.0959	1	log
3	Null	3	-1136.47	47.9011	103.986	0.087652	26.6238	18.8582	0	log
1	Decay only	1	-1039.55	53.1845	7.06546	0.127025	21.1935	6.6104	1	log
0	Full	0	-1032.49	58.5355	0	0.785312	19.6743	0	1	log

AlexAndorra · March 10, 2020, 5:59pm

I’m not sure I’m following you

From what I see, only two of your models have a substantial weight: Decay only and Full
If I’m not mistaken, on the log scale, the higher the LOO, the better. So, based on LOO, the best model would be Full, followed by Decay only, right?
Which means that Full should have the most weight, and Decay Full should follow. This seems to be the case, isn’t it? 0.79 for Full, and 0.13 for Decay only.

So, I’m not seeing any mismatch here – am I missing something (which is clearly possible: I’m no IC expert!)?

sammosummo · March 10, 2020, 6:02pm

The rankings don’t match:

According to LOO its full > decay > death > null.

According to weight its full > decay > null > death.

OriolAbril · March 10, 2020, 8:56pm

Stacking weights do not necessarily follow the same order as loo/waic order nor pseudo-BMA weigths.

There is a very clear example on this in the loo R package vignettes, in " Example: Oceanic tool complexity" section. The results are the following:

       waic_wts pbma_wts pbma_BB_wts stacking_wts
model1     0.40     0.36        0.30         0.00
model2     0.56     0.62        0.53         0.78
model3     0.04     0.02        0.17         0.22

were waic_wts are the weights obtained from normalizing waic over all 3 models (waic\_wts_i = \frac{\exp (waic_i)}{\sum_i \exp (waic_i)} and the other 3 columns are pseudo-BMA, BB-pseudo-BMA and stacking weights. It can be seen that the order according to the first three columns is model2 > model1 > model3 whereas stacking weights orders them as model2 > model3 > model1.

The intuition behind this phenomenon is explained in the same vignette (emphasis mine):

All weights favor the second model with the log population and the contact rate. WAIC weights and Pseudo-BMA weights (without Bayesian bootstrap) are similar, while Pseudo-BMA+ is more cautious and closer to stacking weights.

It may seem surprising that Bayesian stacking is giving zero weight to the first model, but this is likely due to the fact that the estimated effect for the interaction term is close to zero and thus models 1 and 2 give very similar predictions. In other words, incorporating the model with the interaction (model 1) into the model average doesn’t improve the predictions at all and so model 1 is given a weight of 0. On the other hand, models 2 and 3 are giving slightly different predictions and thus their combination may be slightly better than either alone.

See also: [1704.02030] Using stacking to average Bayesian predictive distributions

AlexAndorra · March 10, 2020, 9:28pm

Super clear – and enlightening, thank you Oriol!

sammosummo · March 11, 2020, 2:13pm

This is perfect, thank you.

I guess I didn’t appreciate the difference in intentions of stacking and simple model comparison using LOO.

nkaimcaudle · March 14, 2020, 12:42am

The paper seems to imply that stacking should be preferred to pseudo-BMA weigths. Is this the case? I can see that the default for ArviZ is pseudo-BMA weigths.

sammosummo · March 14, 2020, 12:09pm

Yes, the paper (and others by the same group) outright say that. I don’t know why pseudo-BMA is preferred by ArviZ.

ahartikainen · March 14, 2020, 4:26pm

Probably just legacy reasons.

We (ArviZ) just changed the default for the compare function from waic to loo, so we might also change other defaults if needed.

Topic		Replies	Views
Problem in model comparison	3	218	February 8, 2024
Arviz compare: rank inconsistent with weight v5 arviz	2	584	June 24, 2022
Comparing models - ranks vs weights	2	38	July 30, 2024
Comparing models and selecting it with "weight" of arviz.compare Questions	1	977	June 10, 2021
Is model with high or low WAIC and LOO better version agnostic arviz	1	1600	August 12, 2021

Bayesian model averaging: ranking of model weights and LOO don't match

Related topics