I have built few models using pymc3 and I am comparing them to find the best one using arviz.compare.
Depending on the Information Criteria I use (either ‘WAIC’ or ‘PSIS-LOO’) I get the same rankings, but the IC standard errors differ a lot.
In particular, if I use WAIC, the results highlight one best model: model WAICs are far away from one another when considering the IC standard error. However, if I use PSIS-LOO, although the model ranking remains the same (ranking is based on IC results) the standard error of the ICs seems to imply that the ranking is not significant and thus the model selection is not reliable.
Arviz.compare also provide a metric called “weight” which in the documentation is referred to as “can be loosely interpreted as the probability of each model given the data” (see arviz.compare — ArviZ dev documentation).
I have two questions about the weight metric:
- I have been looking at the code but I am not sure I understood what it is exactly doing, can anyone provide any reference to this specific operation?
- Is it correct to use this measure to decide whether a model is “sufficiently” better than another one, e.g. by imposing a minimum value of the best model weight weight>=0.99)?
thanks a lot for your help