Problem in model comparison

Hello everyone. I used pymc for bayesian model averaging (BMA). Everything is working well until arriving to model comparison where according to pymc documentation for BMA it is noted that:
1. rank, the ranking of the models starting from 0 (best model) to the number of models.
2. loo, the values of LOO (or WAIC). The DataFrame is always sorted from best LOO/WAIC to worst.
3. p_loo, the value of the penalization term. We can roughly think of this value as the estimated effective number of parameters (but do not take that too seriously).
4. d_loo, the relative difference between the value of LOO/WAIC for the top-ranked model and the value of LOO/WAIC for each model. For this reason we will always get a value of 0 for the first model.
5. weight, the weights assigned to each model. These weights can be loosely interpreted as the probability of each model being true (among the compared models) given the data.
6. se, the standard error for the LOO/WAIC computations. The standard error can be useful to assess the uncertainty of the LOO/WAIC estimates. By default these errors are computed using stacking.
7. dse, the standard errors of the difference between two values of LOO/WAIC. The same way that we can compute the standard error for each value of LOO/WAIC, we can compute the standard error of the differences between two values of LOO/WAIC. Notice that both quantities are not necessarily the same, the reason is that the uncertainty about LOO/WAIC is correlated between models. This quantity is always 0 for the top-ranked model.
8. warning, If True the computation of LOO/WAIC may not be reliable.
9. loo_scale, the scale of the reported values. The default is the log scale as previously mentioned. Other options are deviance – this is the log-score multiplied by -2 (this reverts the order: a lower LOO/WAIC will be better) – and negative-log – this is the log-score multiplied by -1 (as with the deviance scale, a lower value is better).

Looking to these definitions, the weights vary in different way than the rank and loo, and the warning also are true, is there any problem here? if yes, please could anyone help me? Thanks

They do not always have the same order, there is some discussion about it here:

Thank you for your response. But, in the warning part, the answers are all true which mean that the computation of LOO/WAIC may not be reliable. Concerning the weights, how the best model doesn’t have the highest weight (probability) to be true?

Regarding the weights, my understanding of the situation is: If you have two models that perform similarly then one of the models is sufficient to explain data. So in a situation where you are trying to explain your observed data by taking a weighted some of your models, one of the models, albeit performing good, does not bring anything new to the table. Therefore you may just as well explain your data by giving all the weight to one of the models. The link I have given above and articles referenced therein discuss this in more detail. I might also suggest this

You can try a test where you sample the identical model twice and try model comparison (below model1 and model3 are identical model2 has extra unneeded parameters):


import pymc as pm
import numpy as np
import arviz as az

seed=0
rng = np.random.default_rng(seed)

x = 2*rng.random(100) - 1
y = 2*rng.random(100) - 1

a_real = 0.4
b_real = -2
c_real = 3
d_real = 0.01

z_real = a_real + b_real*x + c_real*y + d_real*x*y
z_obs = z_real + rng.normal(0, 0.1, size=z_real.shape)


with pm.Model() as model1:

  a = pm.Normal("a", 0, 1)
  b = pm.Normal("b", 0, 1)
  c = pm.Normal("c", 0, 1)
  sd = pm.HalfNormal("sd", 1)

  pm.Normal("likelihood", a + b*x + c*y, sd, observed=z_obs)

  idata1 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)

with model1:
  pm.compute_log_likelihood(idata1)


with pm.Model() as model2:

  a = pm.Normal("a", 0, 1)
  b = pm.Normal("b", 0, 1)
  c = pm.Normal("c", 0, 1)
  d = pm.Normal("d", 0, 1)

  sd = pm.HalfNormal("sd", 1)

  pm.Normal("likelihood",  a + b*x + c*y + d*x*y, sd, observed=z_obs)

  idata2 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)

with model2:
  pm.compute_log_likelihood(idata2)


with pm.Model() as model3:

  a = pm.Normal("a", 0, 1)
  b = pm.Normal("b", 0, 1)
  c = pm.Normal("c", 0, 1)
  sd = pm.HalfNormal("sd", 1)

  pm.Normal("likelihood", a + b*x + c*y, sd, observed=z_obs)

  idata3 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)

with model3:
  pm.compute_log_likelihood(idata3)

comp = az.compare({"model3": idata3, "model2":idata2, "model1":idata1})
print(comp.to_string())


The result I get from here with seed=0

rank      elpd_loo        p_loo     elpd_diff        weight        se       dse  warning scale
model1     0  74.107390  4.373432   0.000000  1.000000e+00  8.389269  0.000000    False   log
model3     1  74.096669  4.373302   0.010720  3.330669e-16  8.376381  0.026016    False   log
model2     2  73.259166  5.240401   0.848224  0.000000e+00  8.218600  0.567958    False   log

You can see that model1 and model3 being identical and having very similar loo values have weights 1 and 0 (I threw in a small interaction and a model with interaction just for fun.

Another run with seed=1 gives

        rank   elpd_loo     p_loo  elpd_diff        weight        se       dse  warning scale
model3     0  94.797507  4.423737   0.000000  5.862351e-01  9.390746  0.000000    False   log
model2     1  94.718390  5.043274   0.079117  4.137649e-01  9.542755  0.936753    False   log
model1     2  94.688634  4.542713   0.108874  4.440892e-16  9.438960  0.056610    False   log

So in a singular situation like this, where there is a lot of uncertainty for the weights of two models (because they perform similarly), what you get for weights will depend on your sample and chains.

As for the warnings, I am afraid I don’t know much about those. One possibility is not enough data though. If you run the code above with 10 datapoints (instead of 100) and seed=1 here is what I get

        rank  elpd_loo     p_loo  elpd_diff        weight        se       dse  warning scale
model3     0  2.474919  3.038526   0.000000  1.000000e+00  1.445940  0.000000    False   log
model1     1  2.332299  3.152023   0.142620  1.665335e-16  1.437265  0.024676    False   log
model2     2  0.627785  4.384493   1.847134  0.000000e+00  1.762000  1.399859     True   log

So it gives a warning for model2 which has more terms to estimate so perhaps 10 data points is not enough. You can also see that with fewer data points (how many data points is few will depend on the number of parameters your model has) difference between loo can be as high as 0.1 between identical models. Running this with seed=2 gives a warning for all of the models above.

ps: I do realize that these do not exactly reproduce the example you have seen in which weights and loo have different orders (although conceptually it does show that weights do not always align with how right the model is). For that I can suggest using one by one each of the models you have as a data generators and doing similar tests as I have done here on simulated data. You can then estimate for instance the effect of sampling on the results you are observing etc etc.

2 Likes