Comparing models and selecting it with "weight" of

Hi everybody,

I have built few models using pymc3 and I am comparing them to find the best one using
Depending on the Information Criteria I use (either ‘WAIC’ or ‘PSIS-LOO’) I get the same rankings, but the IC standard errors differ a lot.

In particular, if I use WAIC, the results highlight one best model: model WAICs are far away from one another when considering the IC standard error. However, if I use PSIS-LOO, although the model ranking remains the same (ranking is based on IC results) the standard error of the ICs seems to imply that the ranking is not significant and thus the model selection is not reliable. also provide a metric called “weight” which in the documentation is referred to as “can be loosely interpreted as the probability of each model given the data” (see — ArviZ dev documentation).

I have two questions about the weight metric:

  1. I have been looking at the code but I am not sure I understood what it is exactly doing, can anyone provide any reference to this specific operation?
  2. Is it correct to use this measure to decide whether a model is “sufficiently” better than another one, e.g. by imposing a minimum value of the best model weight weight>=0.99)?

thanks a lot for your help

1 Like

Interpretation of standard errors is still a bit of an open question, see for example [2008.10296] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison.

The main reference for that is probably the paper linked in the docstring (which we should fix and format as proper references): [1704.02030] Using stacking to average Bayesian predictive distributions