Comparing LOO-scores from a fit via the GLM module with other models

Thanks to @junpenglao, I was now able to use pymc3’s glm module to fit a logistic regression to my data (see original post quoted below) and obtain a loo-Score.

Now I’m facing another problem (?). I want to compare the loo score with those obtained with two other models. These other models are some sigmoids (psychometric functions) that model the success probability for the Binomially distributed data at different levels.

These are plots of the data (points) and posterior predictive samples (line shows mean, shaded bands are 95 % HDIs):

(Sorry for the small font; perhaps you need zoom)

Now the loo-Score (see plots) for the logistic regression via the GLM module is unreasonably low compared to the other models. I’m not sure why this happens.

Could this happen due to the fact the other two models use Binomial likelihoods of the aggregated counts and the GLM module’s logistic reg. is based on the individual binary responses?
And if so, if there a way to correct for this to make the models comparable?

I thought because ultimately those models are fitted on the same data, their LOOs would be comparable… But perhaps the structure of the data (whether it enters Binomial nodes in aggregated form or as individual Bernoulli samples) plays a role?

Any help would be highly appreciated!

Original post (just in case someone came here to look for how to use the GLM module to fit multiple slopes/intercepts):

1 Like

Any ideas? Even vague hints on how to look into this would be highly appreciated :slight_smile:

If the data used is the same the estimated ELPD values (or LOO scores) should be comparable, and if the models have similar predictive power the values should be similar.

Maybe there is some data transformation like a scaling or so? In that case, the pointwise log likelihood values have to be corrected to account for the difference in the jacobian in order to be able to compare them. This blog post I wrote a while back shows exactly how to do this

1 Like

Thanks for the reply! I will look into whether there are any transformations in the GLM module…

My latest guess was that it might have to do something with the number of observations? The “raw” binary observations are the same of course. But in the models with the higher scores 20 binary observations enter as one “X out of 20” (at every level). In the LOO/WAIC paper I see sums running over all observations in the calculations (which I do not fully understand …). I tried the naive thing and “normalized” the score by multipying by 20…but the numbers don’t look right either.

Edit 1: in the GLM version, there is the logit link function, which I guess is a transformation… So would that distort the ICs?

Edit 2: hmm seems the scores are already normalized via the weights so different number of obs is probably not the problem…
I get the idea in your blog post. But not sure yet how I need to transform the likelihood to undo the logit. But I guess it needs to be done for each level independently? Any ideas are welcome!

1 Like

Not sure what you mean by different number of obs. Different number of obs suggests you are quantifying different tasks, in which case you can’t compare their ICs as the absolute scale of the ICs has no meaning. This is probably a bit cryptic, but I think this notebook will clarify what I mean.

1 Like


Concerning the obs, the situation is analog to this:

Say I have 10 trials, each with 5 coin tosses. One model could implement this as 10 binomial “x out of 5” observations… Another model might model this with 50 binary Bernoulli observations.
(all tosses are independent with the same success prob)

This was what I first thought might lead to the discrepancy. But looking at the equations, I does not necessarily have to be a problem if the important weights for a Bernoulli likelihood are smaller than the binomial ones…

1 Like

From what you are explaining, it looks like you are performing leave one trial out when using binomial observations and leave one toss out when using Bernoulli observations.

1 Like

Exactly — that’s at least what I think is going on. I have not entirely understood the internals of the GLM module yet, but the posterior predictive it produces is toss-based (in the code the GLM uses a Binomial likelihood but with N fixed to 1).

So the question is whether the lo-trial-o and lo-toss-o. can be directly compared (or how they can be adjusted). It was my first suspicious that they might need adjustment. But then again the scores might already take the informational value of a sample into account.

It just comes to my mind that I can address this question easily empirically by fitting the simplified example I gave above … Will report back.

They can not be directly compared, this is what the second notebook I shared is about. The situation is similar to the “predicting the outcome of a match” which would be like lo-trial-o and " Predicting the goals scored per match and per team" would be like lo-toss-out. In the example however we have a single model and we compute LOO for different predictive tasks, in your case, you have two models and you have to make sure you are asking the same question to both of them if you want to be able to compare the answers.

With the binomial you have lo-trial-o and you can’t get the lo-toss-o even if you wanted to. You have to group/aggregate the observations like it’s done in the example notebook to get the lo-trial-o from Bernoulli case, because there, like in the example, you can get multiple predictive tasks from the same case.

1 Like

I have not yet had the time to fully work through that notebook (will hopefully soon).

I was just wondering if I couldn’t just do this:
Add a Bernoulli distribution to the Binomial model which uses the estimated success rate to generate tosses in the same form and number as the binary model does and then use the posterior predictive distribution of that.