Help with computing Bayes Factors


Fairly new to using a specific probabilistic programming language.
What I am looking to understand is how I can generate a Bayes Factor ( for two models on a dataset
I have built a linear and a quadratic model to model fake noisy quadratic data, and would like to measure P(Model | Y ) for each.
Some assistance to better understand this would be appreciated


That notebook you linked to is a bit odd (I’ve never quite noticed it before). I might suggest checking out this notebook I wrote. See if that helps a bit more. If not, let me know and we can find you something that does!

So this is only a proposal at the moment, but I’m planning on creating an example notebook covering hypothesis testing and Bayes Factors Bayesian hypothesis testing using the Savage–Dickey method · Issue #349 · pymc-devs/pymc-examples · GitHub

1 Like

Hey and thanks for your response.
I would love to see this when you have it done. And youve hit the nail on the head - exactly what I am attempting to do is hypothesis testing (although comparative testing atm) and trying to better understand what that looks like through the prob.prog. lens. I also discovered WAIC and LOO in the process of looking, but BF feels intuitive to me!
As a rambling aside, one of the challenges I am having in this space at the moment is that while it feels ‘easy’ to build and compare some of these models, there is a lot of weed bashing that one seems to have to do re knowledge or skills before we are smooth sailing on that lake :slight_smile:

Thanks for your response cluhmann!
That notebook kind of helps, but it also kind of raises more questions:

  • Why is the posterior the mean of the trace?
  • Why is P(M2|Y) simply 1 - P(M1|Y) (surely there are other potential models, so calculating a ratio based on what a coder cared to look at feels off… and I would guess that if one did M2 then M1, the answer would also be different)
  • Definitely others that are not on the top of my head atm
    In a similar vein to my aside in the previous post to Ben, I am finding myself easily lost in the weeds in an area I have some professional competency in, and I dont feel it should be this way. I would love to work with you to help find / create something that more obvious to me (and hopefully others too)!

That implementation uses a categorical parameter (pm1). During sampling, this parameter hops back and forth between the 2 states (state 0 corresponding to model 1, state 1 corresponding to model 2). When pm1 is in state 0, the model “becomes” model 1, when pm1 is in state 1, the model “becomes” model 1:

theta = pm.math.switch(pm.math.eq(pm1, 0), theta_0, theta_1)

The posterior of pm1 then indicates how much time pm1 spend in the state 0 (corresponding to model 1). So the mean of all the states visited captures the degree of belief assigned to model 1.

Because a Bayes factor is a ratio that provides a relative model comparison of two models. Evidence for one model is evidence against the other and vice versa. Just because one model is substantially better than another does not suggest it is a good in some universal sense.

Regarding the lack of intuition, I think it’s appropriate when it comes to Bayes factors. I myself don’t find them particularly intuitive, I tend to discourage their use (see the rant in my notebook). They are sort-of-kind-of like standard frequentist tests (e.g., likelihood ratios), but they aren’t They are Bayesian, but not as Bayesian as many Bayesian would prefer (e.g., Bayes factors ignore priors over models and Bayesians would never suggest throwing out a model, but would average them instead). And implementations tend to be difficult to sample (thus the need for the sort of approach taken in Ben’s notebook). So yeah, it’s all weird.

WAIC and LOO are good alternatives. I would check out this paper if you haven’t already.

I missed that there are essentiall 3 models that you have defined!

  • theta0 a model of interest
  • theta1 a competitor model of interest
  • pm1 a “meta” model thats used to evaluate the preference of one model over another for data explainability
    I will continue to pull apart and digest your notebook! And thanks for the paper link too :slight_smile: I have the feeling (like with all data science) building a battery of model comparison, or hypothesis comparison tooling is essential to help dig at both intuition for the underlying problem, but also at the models we use to represent them

Yes! Agree with all of that.

FYI, for anyone else reading this, you should not use the implementation found in that (my) notebook. It is intended for illustrative, pedagogical purposes. I should add a disclaimer (and update it to PyMC v4).

This is perhaps a naive question, but my only systematic exposure to likelihood ratios was from this lecture on likelihood ratio processes. My impression was that the test could be formulated itself as a random process over, e.g., the entire posterior draws of two alternative models. Granted that it’s typically it’s only done at the posterior mode, but would it be incorrect to compute the limiting value of the process taking into account the entire posterior? If it’s not incorrect, does anyone do this? Is there a link between lr processes and bayes factor that I’m not seeing?

These sorts of hybrid tests are of interest to me because I work in a field well and truly dominated by frequentists, so it’s nice to have bridges that allow communication with peers who aren’t totally bought into Bayes.

So, as an attempt for an answer, I dont think we can do what you are suggesting. The objective here isnt to sample from the posterior because we arent particularly concerned with those results. What we are interested in from a hypothesis perspective is given the data which model looks preferable.

That said, there are reasons to sample the posterior as we can check it against unseen data to see if there are similarities as another layer of testing.

From a frequentist perspective, it feels like what you are trying to do is define a distributions over the posterior, and over the data and then grab some form of test for measures of statistical differences but I either dont think that quite fits here, or it is circuitous.

But this is just a stab at an answer to your question :slight_smile:

1 Like

Looking over that page, I don’t see anything obvious that would dispute your interpretation. I have no idea if people do this (or think it about things in this way), though I might check out the sequential testing literature. But that’s just a guess.

1 Like

Sure, inference is hard! It’s an ongoing journey of learning and gaining familiarity

Hello again :slight_smile:

Im working through your notebook, and I tried to understand your ‘meta model’ in an experiment of my own. Where I got confused was with the meta model being Bernoulli (that makes sense because we have competing models), but how do we continue this line of thinking when the data is not binary.

The example that I have been trying to work through is where I have two regression models trying to fit a line of best fit to some data. I would imagine the data I feed to the meta model would be the noisy observation points, but as stated above, I dont know how to do this.

Any advice would be very generous :slight_smile:

1 Like

If you are comparing nested regression models, then I would imagine that you would build the version that includes the quadratic term and then use your “indicator” to flip between the actual quadratic coefficient and zero (reducing the quadratic to the special case, linear model). But I would definitely recommend starting with the standard, quadratic regression model before trying to do the model comparison. This notebook should help.