I recently compared some models using pm.compare
. When I use “stacking”, the rankings by LOO don’t match those by weight:
|
index |
rank |
loo |
p_loo |
d_loo |
weight |
se |
dse |
warning |
loo_scale |
0 |
Full |
0 |
-1032.49 |
58.5355 |
0 |
0.785312 |
19.6743 |
0 |
1 |
log |
1 |
Decay only |
1 |
-1039.55 |
53.1845 |
7.06546 |
0.127025 |
21.1935 |
6.6104 |
1 |
log |
2 |
Death only |
2 |
-1077.52 |
62.0865 |
45.0306 |
1.06858e-05 |
21.1415 |
11.0959 |
1 |
log |
3 |
Null |
3 |
-1136.47 |
47.9011 |
103.986 |
0.087652 |
26.6238 |
18.8582 |
0 |
log |
Has anyone seen this happen before? Perhaps it has something to do with the warnings?
1 Like
Hi Sam!
Which version of ArviZ are you using?
In the most recent (0.7.0), the default scale is now log (as in your picture), which means that the higher the LOO, the better the model. So, in that case, I think the LOO rankings match the rankings by weights.
I’m using ArviZ 0.7.0 and they definitely don’t match. Here is the same table ordered by LOO:
|
index |
rank |
loo |
p_loo |
d_loo |
weight |
se |
dse |
warning |
loo_scale |
3 |
Null |
3 |
-1136.47 |
47.9011 |
103.986 |
0.087652 |
26.6238 |
18.8582 |
0 |
log |
2 |
Death only |
2 |
-1077.52 |
62.0865 |
45.0306 |
1.06858e-05 |
21.1415 |
11.0959 |
1 |
log |
1 |
Decay only |
1 |
-1039.55 |
53.1845 |
7.06546 |
0.127025 |
21.1935 |
6.6104 |
1 |
log |
0 |
Full |
0 |
-1032.49 |
58.5355 |
0 |
0.785312 |
19.6743 |
0 |
1 |
log |
And again by weight:
|
index |
rank |
loo |
p_loo |
d_loo |
weight |
se |
dse |
warning |
loo_scale |
2 |
Death only |
2 |
-1077.52 |
62.0865 |
45.0306 |
1.06858e-05 |
21.1415 |
11.0959 |
1 |
log |
3 |
Null |
3 |
-1136.47 |
47.9011 |
103.986 |
0.087652 |
26.6238 |
18.8582 |
0 |
log |
1 |
Decay only |
1 |
-1039.55 |
53.1845 |
7.06546 |
0.127025 |
21.1935 |
6.6104 |
1 |
log |
0 |
Full |
0 |
-1032.49 |
58.5355 |
0 |
0.785312 |
19.6743 |
0 |
1 |
log |
I’m not sure I’m following you
- From what I see, only two of your models have a substantial weight: Decay only and Full
- If I’m not mistaken, on the log scale, the higher the LOO, the better. So, based on LOO, the best model would be Full, followed by Decay only, right?
- Which means that Full should have the most weight, and Decay Full should follow. This seems to be the case, isn’t it? 0.79 for Full, and 0.13 for Decay only.
So, I’m not seeing any mismatch here – am I missing something (which is clearly possible: I’m no IC expert!)?
The rankings don’t match:
According to LOO its full > decay > death > null.
According to weight its full > decay > null > death.
Stacking weights do not necessarily follow the same order as loo/waic order nor pseudo-BMA weigths.
There is a very clear example on this in the loo
R package vignettes, in " Example: Oceanic tool complexity" section. The results are the following:
waic_wts pbma_wts pbma_BB_wts stacking_wts
model1 0.40 0.36 0.30 0.00
model2 0.56 0.62 0.53 0.78
model3 0.04 0.02 0.17 0.22
were waic_wts
are the weights obtained from normalizing waic over all 3 models (waic\_wts_i = \frac{\exp (waic_i)}{\sum_i \exp (waic_i)} and the other 3 columns are pseudo-BMA, BB-pseudo-BMA and stacking weights. It can be seen that the order according to the first three columns is model2 > model1 > model3 whereas stacking weights orders them as model2 > model3 > model1.
The intuition behind this phenomenon is explained in the same vignette (emphasis mine):
All weights favor the second model with the log population and the contact rate. WAIC weights and Pseudo-BMA weights (without Bayesian bootstrap) are similar, while Pseudo-BMA+ is more cautious and closer to stacking weights.
It may seem surprising that Bayesian stacking is giving zero weight to the first model, but this is likely due to the fact that the estimated effect for the interaction term is close to zero and thus models 1 and 2 give very similar predictions. In other words, incorporating the model with the interaction (model 1) into the model average doesn’t improve the predictions at all and so model 1 is given a weight of 0. On the other hand, models 2 and 3 are giving slightly different predictions and thus their combination may be slightly better than either alone.
See also: [1704.02030] Using stacking to average Bayesian predictive distributions
6 Likes
Super clear – and enlightening, thank you Oriol!
This is perfect, thank you.
I guess I didn’t appreciate the difference in intentions of stacking and simple model comparison using LOO.
The paper seems to imply that stacking should be preferred to pseudo-BMA weigths. Is this the case? I can see that the default for ArviZ is pseudo-BMA weigths.
Yes, the paper (and others by the same group) outright say that. I don’t know why pseudo-BMA is preferred by ArviZ.
Probably just legacy reasons.
We (ArviZ) just changed the default for the compare
function from waic
to loo
, so we might also change other defaults if needed.
2 Likes