I recently compared some models using pm.compare
. When I use “stacking”, the rankings by LOO don’t match those by weight:

index 
rank 
loo 
p_loo 
d_loo 
weight 
se 
dse 
warning 
loo_scale 
0 
Full 
0 
1032.49 
58.5355 
0 
0.785312 
19.6743 
0 
1 
log 
1 
Decay only 
1 
1039.55 
53.1845 
7.06546 
0.127025 
21.1935 
6.6104 
1 
log 
2 
Death only 
2 
1077.52 
62.0865 
45.0306 
1.06858e05 
21.1415 
11.0959 
1 
log 
3 
Null 
3 
1136.47 
47.9011 
103.986 
0.087652 
26.6238 
18.8582 
0 
log 
Has anyone seen this happen before? Perhaps it has something to do with the warnings?
1 Like
Hi Sam!
Which version of ArviZ are you using?
In the most recent (0.7.0), the default scale is now log (as in your picture), which means that the higher the LOO, the better the model. So, in that case, I think the LOO rankings match the rankings by weights.
I’m using ArviZ 0.7.0 and they definitely don’t match. Here is the same table ordered by LOO:

index 
rank 
loo 
p_loo 
d_loo 
weight 
se 
dse 
warning 
loo_scale 
3 
Null 
3 
1136.47 
47.9011 
103.986 
0.087652 
26.6238 
18.8582 
0 
log 
2 
Death only 
2 
1077.52 
62.0865 
45.0306 
1.06858e05 
21.1415 
11.0959 
1 
log 
1 
Decay only 
1 
1039.55 
53.1845 
7.06546 
0.127025 
21.1935 
6.6104 
1 
log 
0 
Full 
0 
1032.49 
58.5355 
0 
0.785312 
19.6743 
0 
1 
log 
And again by weight:

index 
rank 
loo 
p_loo 
d_loo 
weight 
se 
dse 
warning 
loo_scale 
2 
Death only 
2 
1077.52 
62.0865 
45.0306 
1.06858e05 
21.1415 
11.0959 
1 
log 
3 
Null 
3 
1136.47 
47.9011 
103.986 
0.087652 
26.6238 
18.8582 
0 
log 
1 
Decay only 
1 
1039.55 
53.1845 
7.06546 
0.127025 
21.1935 
6.6104 
1 
log 
0 
Full 
0 
1032.49 
58.5355 
0 
0.785312 
19.6743 
0 
1 
log 
I’m not sure I’m following you
 From what I see, only two of your models have a substantial weight: Decay only and Full
 If I’m not mistaken, on the log scale, the higher the LOO, the better. So, based on LOO, the best model would be Full, followed by Decay only, right?
 Which means that Full should have the most weight, and Decay Full should follow. This seems to be the case, isn’t it? 0.79 for Full, and 0.13 for Decay only.
So, I’m not seeing any mismatch here – am I missing something (which is clearly possible: I’m no IC expert!)?
The rankings don’t match:
According to LOO its full > decay > death > null.
According to weight its full > decay > null > death.
Stacking weights do not necessarily follow the same order as loo/waic order nor pseudoBMA weigths.
There is a very clear example on this in the loo
R package vignettes, in " Example: Oceanic tool complexity" section. The results are the following:
waic_wts pbma_wts pbma_BB_wts stacking_wts
model1 0.40 0.36 0.30 0.00
model2 0.56 0.62 0.53 0.78
model3 0.04 0.02 0.17 0.22
were waic_wts
are the weights obtained from normalizing waic over all 3 models (waic\_wts_i = \frac{\exp (waic_i)}{\sum_i \exp (waic_i)} and the other 3 columns are pseudoBMA, BBpseudoBMA and stacking weights. It can be seen that the order according to the first three columns is model2 > model1 > model3 whereas stacking weights orders them as model2 > model3 > model1.
The intuition behind this phenomenon is explained in the same vignette (emphasis mine):
All weights favor the second model with the log population and the contact rate. WAIC weights and PseudoBMA weights (without Bayesian bootstrap) are similar, while PseudoBMA+ is more cautious and closer to stacking weights.
It may seem surprising that Bayesian stacking is giving zero weight to the first model, but this is likely due to the fact that the estimated effect for the interaction term is close to zero and thus models 1 and 2 give very similar predictions. In other words, incorporating the model with the interaction (model 1) into the model average doesn’t improve the predictions at all and so model 1 is given a weight of 0. On the other hand, models 2 and 3 are giving slightly different predictions and thus their combination may be slightly better than either alone.
See also: [1704.02030] Using stacking to average Bayesian predictive distributions
6 Likes
Super clear – and enlightening, thank you Oriol!
This is perfect, thank you.
I guess I didn’t appreciate the difference in intentions of stacking and simple model comparison using LOO.
The paper seems to imply that stacking should be preferred to pseudoBMA weigths. Is this the case? I can see that the default for ArviZ is pseudoBMA weigths.
Yes, the paper (and others by the same group) outright say that. I don’t know why pseudoBMA is preferred by ArviZ.
Probably just legacy reasons.
We (ArviZ) just changed the default for the compare
function from waic
to loo
, so we might also change other defaults if needed.
2 Likes