Bayesian model averaging: ranking of model weights and LOO don't match

Stacking weights do not necessarily follow the same order as loo/waic order nor pseudo-BMA weigths.

There is a very clear example on this in the loo R package vignettes, in " Example: Oceanic tool complexity" section. The results are the following:

       waic_wts pbma_wts pbma_BB_wts stacking_wts
model1     0.40     0.36        0.30         0.00
model2     0.56     0.62        0.53         0.78
model3     0.04     0.02        0.17         0.22

were waic_wts are the weights obtained from normalizing waic over all 3 models (waic\_wts_i = \frac{\exp (waic_i)}{\sum_i \exp (waic_i)} and the other 3 columns are pseudo-BMA, BB-pseudo-BMA and stacking weights. It can be seen that the order according to the first three columns is model2 > model1 > model3 whereas stacking weights orders them as model2 > model3 > model1.

The intuition behind this phenomenon is explained in the same vignette (emphasis mine):

All weights favor the second model with the log population and the contact rate. WAIC weights and Pseudo-BMA weights (without Bayesian bootstrap) are similar, while Pseudo-BMA+ is more cautious and closer to stacking weights.

It may seem surprising that Bayesian stacking is giving zero weight to the first model, but this is likely due to the fact that the estimated effect for the interaction term is close to zero and thus models 1 and 2 give very similar predictions. In other words, incorporating the model with the interaction (model 1) into the model average doesn’t improve the predictions at all and so model 1 is given a weight of 0. On the other hand, models 2 and 3 are giving slightly different predictions and thus their combination may be slightly better than either alone.

See also: [1704.02030] Using stacking to average Bayesian predictive distributions

6 Likes