Problem in model comparison

Regarding the weights, my understanding of the situation is: If you have two models that perform similarly then one of the models is sufficient to explain data. So in a situation where you are trying to explain your observed data by taking a weighted some of your models, one of the models, albeit performing good, does not bring anything new to the table. Therefore you may just as well explain your data by giving all the weight to one of the models. The link I have given above and articles referenced therein discuss this in more detail. I might also suggest this

You can try a test where you sample the identical model twice and try model comparison (below model1 and model3 are identical model2 has extra unneeded parameters):


import pymc as pm
import numpy as np
import arviz as az

seed=0
rng = np.random.default_rng(seed)

x = 2*rng.random(100) - 1
y = 2*rng.random(100) - 1

a_real = 0.4
b_real = -2
c_real = 3
d_real = 0.01

z_real = a_real + b_real*x + c_real*y + d_real*x*y
z_obs = z_real + rng.normal(0, 0.1, size=z_real.shape)


with pm.Model() as model1:

  a = pm.Normal("a", 0, 1)
  b = pm.Normal("b", 0, 1)
  c = pm.Normal("c", 0, 1)
  sd = pm.HalfNormal("sd", 1)

  pm.Normal("likelihood", a + b*x + c*y, sd, observed=z_obs)

  idata1 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)

with model1:
  pm.compute_log_likelihood(idata1)


with pm.Model() as model2:

  a = pm.Normal("a", 0, 1)
  b = pm.Normal("b", 0, 1)
  c = pm.Normal("c", 0, 1)
  d = pm.Normal("d", 0, 1)

  sd = pm.HalfNormal("sd", 1)

  pm.Normal("likelihood",  a + b*x + c*y + d*x*y, sd, observed=z_obs)

  idata2 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)

with model2:
  pm.compute_log_likelihood(idata2)


with pm.Model() as model3:

  a = pm.Normal("a", 0, 1)
  b = pm.Normal("b", 0, 1)
  c = pm.Normal("c", 0, 1)
  sd = pm.HalfNormal("sd", 1)

  pm.Normal("likelihood", a + b*x + c*y, sd, observed=z_obs)

  idata3 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)

with model3:
  pm.compute_log_likelihood(idata3)

comp = az.compare({"model3": idata3, "model2":idata2, "model1":idata1})
print(comp.to_string())


The result I get from here with seed=0

rank      elpd_loo        p_loo     elpd_diff        weight        se       dse  warning scale
model1     0  74.107390  4.373432   0.000000  1.000000e+00  8.389269  0.000000    False   log
model3     1  74.096669  4.373302   0.010720  3.330669e-16  8.376381  0.026016    False   log
model2     2  73.259166  5.240401   0.848224  0.000000e+00  8.218600  0.567958    False   log

You can see that model1 and model3 being identical and having very similar loo values have weights 1 and 0 (I threw in a small interaction and a model with interaction just for fun.

Another run with seed=1 gives

        rank   elpd_loo     p_loo  elpd_diff        weight        se       dse  warning scale
model3     0  94.797507  4.423737   0.000000  5.862351e-01  9.390746  0.000000    False   log
model2     1  94.718390  5.043274   0.079117  4.137649e-01  9.542755  0.936753    False   log
model1     2  94.688634  4.542713   0.108874  4.440892e-16  9.438960  0.056610    False   log

So in a singular situation like this, where there is a lot of uncertainty for the weights of two models (because they perform similarly), what you get for weights will depend on your sample and chains.

As for the warnings, I am afraid I don’t know much about those. One possibility is not enough data though. If you run the code above with 10 datapoints (instead of 100) and seed=1 here is what I get

        rank  elpd_loo     p_loo  elpd_diff        weight        se       dse  warning scale
model3     0  2.474919  3.038526   0.000000  1.000000e+00  1.445940  0.000000    False   log
model1     1  2.332299  3.152023   0.142620  1.665335e-16  1.437265  0.024676    False   log
model2     2  0.627785  4.384493   1.847134  0.000000e+00  1.762000  1.399859     True   log

So it gives a warning for model2 which has more terms to estimate so perhaps 10 data points is not enough. You can also see that with fewer data points (how many data points is few will depend on the number of parameters your model has) difference between loo can be as high as 0.1 between identical models. Running this with seed=2 gives a warning for all of the models above.

ps: I do realize that these do not exactly reproduce the example you have seen in which weights and loo have different orders (although conceptually it does show that weights do not always align with how right the model is). For that I can suggest using one by one each of the models you have as a data generators and doing similar tests as I have done here on simulated data. You can then estimate for instance the effect of sampling on the results you are observing etc etc.

2 Likes