Suggestions on comparing Bayesian models to classical machine learning models

I am working on a problem where I want to compare a Bayesian model to an ML model. I was wondering if anyone has suggestions on which approaches would be fair for both methods. I have compared Bayesian models to one another using, but never a Bayesian model to ML models.

For example, I want to compare a hierarchical linear model to BART to an xgboost model. Is a classic k-fold CV a good approach or are there other methods for OOS prediction that are preferable. Also which definition of “error” is valid for both approaches while also being computationally feasible.


Would you want to compare point estimates from Bayesian models (ie, take the MAP or expected value of the posterior) to point estimate from an ML model? CV would be the way to go. What you’re looking at with is the result of an approximation of CV that’s not possible to derive for a generic ML model.

A main strength of Bayesian models is that they do give full posteriors where uncertainty from ML models is usually tacked on or has to be calibrated maybe with an additional Bayesian model, so keep that in mind when comparing point estimates! I think you can use whatever definition of error is most relevant to your context.

1 Like

Thanks for the reply!

So what it comes down to is that our team is slightly split between a traditional ML approach and a Bayesian modeling approach (the direction I lean). The Bayesian approach is a clear winner when it comes to uncertainty, however, we have a lot of data so using MCMC is significantly slower… (I am exploring using VI as an in between but I am not quite there yet).

For example, one of the debates we are having is for some features, we model them as a level within a hierarchical model. However, for the ML approach we can only treat them as an additional predictor.

What you’re looking at with is the result of an approximation of CV that’s not possible to derive for a generic ML model.

I have loved using for comparing Bayesian models of different complexity, centered vs noncentered, etc., but I do understand it is not possible for comparing to ML.

Would you want to compare point estimates from Bayesian models (ie, take the MAP or expected value of the posterior) to point estimate from an ML model? CV would be the way to go.

While the uncertainty is important, in order to justify my claim I think I need some measure of OOS prediction accuracy… so I guess to make the comparison fair it would have to be a point estimate comparison… so maybe something like R^2 or MSE and k-fold CV? Is it fair to compare something like a hierarchical Bayes model to an ML model of slightly different structure?


So what it comes down to is that our team is slightly split between a traditional ML approach and a Bayesian modeling approach (the direction I lean)

Been there :sweat_smile: But why not both?

For example, one of the debates we are having is for some features, we model them as a level within a hierarchical model. However, for the ML approach we can only treat them as an additional predictor.

This is a great point. This situation handled particularly well by a Bayesian model. For an example, say your training data comes from some study done at 5 different hospitals. And using hospital_id as a predictor is very helpful, because they did things a bit differently at each hospital. Maybe the ML model will do really well in CV using data from those 5 hospitals, but in the future how do you predict for a hospital that’s not one of these five? The only way is to remove that feature, or maybe do something really hacky. In a hierarchical Bayesian model is perfect for handling this very common case.

I guess overall I’m of the opinion that if you don’t really know the data generation process and you have a lot of data, ML models will often perform better on various metrics (though you’re maybe overfitting a little) because they can adapt to all sorts of non-linear hypothesis spaces without you having to understand much of it. And if you don’t need uncertainty that’s a great place to use them. But, if you do have a good handle on the data generation process and you can represent that structure in a model, then your Bayesian model will probably win. Also I think Bayesian models can be more useful because of the explainability aspect. In my experience when there are a lot of “why” questions that come after forecasts or predictions are made and that’s where Bayesian methods really shine.

But as far as your actual question about specific metrics… not sure! Coverage is another one you could consider, ie, is the true value within the 80% posterior interval 80% of the time?

Been there :sweat_smile: But why not both?

Totally agree and we definitely use both. Like you said, ML often wins when we have lots of data and don’t worry about some of the points you made about levels/predictors.

This is a great point. This situation handled particularly well by a Bayesian model. For an example, say your training data comes from some study done at 5 different hospitals. And using hospital_id as a predictor is very helpful, because they did things a bit differently at each hospital. Maybe the ML model will do really well in CV using data from those 5 hospitals, but in the future how do you predict for a hospital that’s not one of these five? The only way is to remove that feature, or maybe do something really hacky. In a hierarchical Bayesian model is perfect for handling this very common case.

:point_up: Love this. Great explanation that really illustrates the point.

Thanks for all your points. I think this does help clarify when/why one is more suitable. Also the coverage suggestion seems really helpful.


Apologies for jumping in but I would love to know more about the explainability aspect of the bayesian models.

Specifically, how do bayesian models offer more explainability in terms of model predictions and features and their causality (as compared to other ML models).
Moreover, how can we relate the quantification of the uncertainty (offered by bayesian models) to the model explainability.

I am too working on a little project where i have to justify my use of using bayesian regression models in terms of model and features explainability. So any help or guidance in this regard would be extremely appreciated.


Any help would be greatly appreciated!

I would also like to know more about the explainability of pymc models - especially understanding each feature’s role in the final prediction. I expect the explainability is in there but I’m having trouble finding the right methods to show it.

I think what he meant by saying that “Bayesian models can be more useful because of the explainability aspect” was more of a reference to the nature of Bayesian statistics. Like, we need to understand the data and expected patterns at least a little bit before initializing a prior distribution. Even if we begin with a bad initial estimate, there is logic that can be followed to see how the model went from our initial guess to the posterior estimate. With an ML model, we don’t need to initialize the model with any kind of prior estimations and thus, we don’t necessarily know where the model begins its estimation nor how it gets to it’s final learned-state.

Correct me if I’m wrong, but Bayesian statistics requires a bit more work on the user end than ML methods because we need to summarize the data and have a guess for how it is distributed.

In my original question about ML vs Bayesian models, I was most concerned with cases where we could use a hierarchical structure in the data. In classical statistical parlance that would be cases where we might consider a mixed effects model over a fixed effects model. For me personally, I essentially default to using a Bayesian models whenever possible and ML models otherwise. There are cases where I only care about prediction and will just use ML, but most my work revolves around models where understanding is required.

In my work, I often have cases where a variable can be used to index within the hierarchical structure. I care about explainability mainly in the form of the model structure as well as the feature weights. For example, consider the classic penguins dataset . Let’s say I want to fit a model that predicts body mass from flipper length. The final part of the might look like this:

obs_mean = alpha + beta*df["flipper_length_mm"]
sigma = pm.Exponential("sigma",lam=1)
obs = pm.Normal("likelihood",mu=obs_mean,sigma=sigma, observed=df["body_mass_g"].to_numpy())

Now, I want to account for the sex of the penguin. In an ML model, I can only do this by fitting two individual models or adding sex as another parameter. In a bayesian model, I could do the same OR I could introduce it as a level in a hierarchical structure.

Method 1: Penalty for sex

obs_mean = alpha + beta_sex*df["sex"].map({"MALE": 0, "FEMALE": 1}).values + beta*df["flipper_length_mm"].values
obs = pm.Normal("likelihood",mu=obs_mean,sigma=sigma, observed=df["body_mass_g"].to_numpy())

Method 2: Random intercept and slope for sex

sex_idx = df["sex"].map({"MALE": 0, "FEMALE": 1}).values
obs_mean = alpha[sex_idx] + beta[sex_idx]*df["flipper_length_mm"].values
obs = pm.Normal("likelihood",mu=obs_mean,sigma=sigma, observed=df["body_mass_g"].to_numpy())

Now, depending on the dataset, it is entirely possible that the predictions from these two models are very similar. However, there is a big difference in their interpretation. In the first, we simply assume that there is some global slope and intercept for penguins and that sex acts as a penalty, such as a varying intercept in a random effects model. However, the slope remains the same, therefore the penalty is exactly the same across all flipper lengths–just a linear shift in the line. In the second, we not only allow for variation in that slope and intercept, but we explicitly control how they are varied based on the priors we choose. They could be constrained to only positive, they could be normal around 0 (so positive and negative slope/intercept), they could be heavy-tailed from a student-t…

The Bayesian model allows for extremely complicated model structures if the problem requires. By visualizing the model graph we can see the relationship between variables and the assumptions we make by choosing that structure.

In my non-expert experience, this is how I see a main point of difference between the models. Again, prediction might be the same for the two bayesian models and ML model, however, we can say so much more about our assumptions in the Bayesian models.

There are certain types of explainability that relate to what is happening under the hood, such as using MCMC for a Bayesian model vs gradient descent for a neural network. I am not really concerned with those as it relates to the original question that I posted.

I hope this helps.