Fairly new to using a specific probabilistic programming language.
What I am looking to understand is how I can generate a Bayes Factor (https://www.pymc.io/projects/examples/en/latest/diagnostics_and_criticism/Bayes_factor.html) for two models on a dataset
I have built a linear and a quadratic model to model fake noisy quadratic data, and would like to measure P(Model | Y ) for each.
Some assistance to better understand this would be appreciated

That notebook you linked to is a bit odd (Iāve never quite noticed it before). I might suggest checking out this notebook I wrote. See if that helps a bit more. If not, let me know and we can find you something that does!

Hey and thanks for your response.
I would love to see this when you have it done. And youve hit the nail on the head - exactly what I am attempting to do is hypothesis testing (although comparative testing atm) and trying to better understand what that looks like through the prob.prog. lens. I also discovered WAIC and LOO in the process of looking, but BF feels intuitive to me!
As a rambling aside, one of the challenges I am having in this space at the moment is that while it feels āeasyā to build and compare some of these models, there is a lot of weed bashing that one seems to have to do re knowledge or skills before we are smooth sailing on that lake

Thanks for your response cluhmann!
That notebook kind of helps, but it also kind of raises more questions:

Why is the posterior the mean of the trace?

Why is P(M2|Y) simply 1 - P(M1|Y) (surely there are other potential models, so calculating a ratio based on what a coder cared to look at feels offā¦ and I would guess that if one did M2 then M1, the answer would also be different)

Definitely others that are not on the top of my head atm
In a similar vein to my aside in the previous post to Ben, I am finding myself easily lost in the weeds in an area I have some professional competency in, and I dont feel it should be this way. I would love to work with you to help find / create something that more obvious to me (and hopefully others too)!

That implementation uses a categorical parameter (pm1). During sampling, this parameter hops back and forth between the 2 states (state 0 corresponding to model 1, state 1 corresponding to model 2). When pm1 is in state 0, the model ābecomesā model 1, when pm1 is in state 1, the model ābecomesā model 1:

The posterior of pm1 then indicates how much time pm1 spend in the state 0 (corresponding to model 1). So the mean of all the states visited captures the degree of belief assigned to model 1.

Because a Bayes factor is a ratio that provides a relative model comparison of two models. Evidence for one model is evidence against the other and vice versa. Just because one model is substantially better than another does not suggest it is a good in some universal sense.

Regarding the lack of intuition, I think itās appropriate when it comes to Bayes factors. I myself donāt find them particularly intuitive, I tend to discourage their use (see the rant in my notebook). They are sort-of-kind-of like standard frequentist tests (e.g., likelihood ratios), but they arenāt They are Bayesian, but not as Bayesian as many Bayesian would prefer (e.g., Bayes factors ignore priors over models and Bayesians would never suggest throwing out a model, but would average them instead). And implementations tend to be difficult to sample (thus the need for the sort of approach taken in Benās notebook). So yeah, itās all weird.

WAIC and LOO are good alternatives. I would check out this paper if you havenāt already.

Aha!
I missed that there are essentiall 3 models that you have defined!

theta0 a model of interest

theta1 a competitor model of interest

pm1 a āmetaā model thats used to evaluate the preference of one model over another for data explainability
I will continue to pull apart and digest your notebook! And thanks for the paper link too I have the feeling (like with all data science) building a battery of model comparison, or hypothesis comparison tooling is essential to help dig at both intuition for the underlying problem, but also at the models we use to represent them

FYI, for anyone else reading this, you should not use the implementation found in that (my) notebook. It is intended for illustrative, pedagogical purposes. I should add a disclaimer (and update it to PyMC v4).

This is perhaps a naive question, but my only systematic exposure to likelihood ratios was from this lecture on likelihood ratio processes. My impression was that the test could be formulated itself as a random process over, e.g., the entire posterior draws of two alternative models. Granted that itās typically itās only done at the posterior mode, but would it be incorrect to compute the limiting value of the process taking into account the entire posterior? If itās not incorrect, does anyone do this? Is there a link between lr processes and bayes factor that Iām not seeing?

These sorts of hybrid tests are of interest to me because I work in a field well and truly dominated by frequentists, so itās nice to have bridges that allow communication with peers who arenāt totally bought into Bayes.

So, as an attempt for an answer, I dont think we can do what you are suggesting. The objective here isnt to sample from the posterior because we arent particularly concerned with those results. What we are interested in from a hypothesis perspective is given the data which model looks preferable.

That said, there are reasons to sample the posterior as we can check it against unseen data to see if there are similarities as another layer of testing.

From a frequentist perspective, it feels like what you are trying to do is define a distributions over the posterior, and over the data and then grab some form of test for measures of statistical differences but I either dont think that quite fits here, or it is circuitous.

But this is just a stab at an answer to your question

Looking over that page, I donāt see anything obvious that would dispute your interpretation. I have no idea if people do this (or think it about things in this way), though I might check out the sequential testing literature. But thatās just a guess.

Im working through your notebook, and I tried to understand your āmeta modelā in an experiment of my own. Where I got confused was with the meta model being Bernoulli (that makes sense because we have competing models), but how do we continue this line of thinking when the data is not binary.

The example that I have been trying to work through is where I have two regression models trying to fit a line of best fit to some data. I would imagine the data I feed to the meta model would be the noisy observation points, but as stated above, I dont know how to do this.

If you are comparing nested regression models, then I would imagine that you would build the version that includes the quadratic term and then use your āindicatorā to flip between the actual quadratic coefficient and zero (reducing the quadratic to the special case, linear model). But I would definitely recommend starting with the standard, quadratic regression model before trying to do the model comparison. This notebook should help.