Regarding the weights, my understanding of the situation is: If you have two models that perform similarly then one of the models is sufficient to explain data. So in a situation where you are trying to explain your observed data by taking a weighted some of your models, one of the models, albeit performing good, does not bring anything new to the table. Therefore you may just as well explain your data by giving all the weight to one of the models. The link I have given above and articles referenced therein discuss this in more detail. I might also suggest this
You can try a test where you sample the identical model twice and try model comparison (below model1 and model3 are identical model2 has extra unneeded parameters):
import pymc as pm
import numpy as np
import arviz as az
seed=0
rng = np.random.default_rng(seed)
x = 2*rng.random(100) - 1
y = 2*rng.random(100) - 1
a_real = 0.4
b_real = -2
c_real = 3
d_real = 0.01
z_real = a_real + b_real*x + c_real*y + d_real*x*y
z_obs = z_real + rng.normal(0, 0.1, size=z_real.shape)
with pm.Model() as model1:
a = pm.Normal("a", 0, 1)
b = pm.Normal("b", 0, 1)
c = pm.Normal("c", 0, 1)
sd = pm.HalfNormal("sd", 1)
pm.Normal("likelihood", a + b*x + c*y, sd, observed=z_obs)
idata1 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)
with model1:
pm.compute_log_likelihood(idata1)
with pm.Model() as model2:
a = pm.Normal("a", 0, 1)
b = pm.Normal("b", 0, 1)
c = pm.Normal("c", 0, 1)
d = pm.Normal("d", 0, 1)
sd = pm.HalfNormal("sd", 1)
pm.Normal("likelihood", a + b*x + c*y + d*x*y, sd, observed=z_obs)
idata2 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)
with model2:
pm.compute_log_likelihood(idata2)
with pm.Model() as model3:
a = pm.Normal("a", 0, 1)
b = pm.Normal("b", 0, 1)
c = pm.Normal("c", 0, 1)
sd = pm.HalfNormal("sd", 1)
pm.Normal("likelihood", a + b*x + c*y, sd, observed=z_obs)
idata3 = pm.sample(2000, tune=1000, chains=6, random_seed=rng)
with model3:
pm.compute_log_likelihood(idata3)
comp = az.compare({"model3": idata3, "model2":idata2, "model1":idata1})
print(comp.to_string())
The result I get from here with seed=0
rank elpd_loo p_loo elpd_diff weight se dse warning scale
model1 0 74.107390 4.373432 0.000000 1.000000e+00 8.389269 0.000000 False log
model3 1 74.096669 4.373302 0.010720 3.330669e-16 8.376381 0.026016 False log
model2 2 73.259166 5.240401 0.848224 0.000000e+00 8.218600 0.567958 False log
You can see that model1 and model3 being identical and having very similar loo values have weights 1 and 0 (I threw in a small interaction and a model with interaction just for fun.
Another run with seed=1 gives
rank elpd_loo p_loo elpd_diff weight se dse warning scale
model3 0 94.797507 4.423737 0.000000 5.862351e-01 9.390746 0.000000 False log
model2 1 94.718390 5.043274 0.079117 4.137649e-01 9.542755 0.936753 False log
model1 2 94.688634 4.542713 0.108874 4.440892e-16 9.438960 0.056610 False log
So in a singular situation like this, where there is a lot of uncertainty for the weights of two models (because they perform similarly), what you get for weights will depend on your sample and chains.
As for the warnings, I am afraid I don’t know much about those. One possibility is not enough data though. If you run the code above with 10 datapoints (instead of 100) and seed=1 here is what I get
rank elpd_loo p_loo elpd_diff weight se dse warning scale
model3 0 2.474919 3.038526 0.000000 1.000000e+00 1.445940 0.000000 False log
model1 1 2.332299 3.152023 0.142620 1.665335e-16 1.437265 0.024676 False log
model2 2 0.627785 4.384493 1.847134 0.000000e+00 1.762000 1.399859 True log
So it gives a warning for model2 which has more terms to estimate so perhaps 10 data points is not enough. You can also see that with fewer data points (how many data points is few will depend on the number of parameters your model has) difference between loo can be as high as 0.1 between identical models. Running this with seed=2 gives a warning for all of the models above.
ps: I do realize that these do not exactly reproduce the example you have seen in which weights and loo have different orders (although conceptually it does show that weights do not always align with how right the model is). For that I can suggest using one by one each of the models you have as a data generators and doing similar tests as I have done here on simulated data. You can then estimate for instance the effect of sampling on the results you are observing etc etc.