Most appropriate problems to solve for Bayesian models vs Machine learning model

This is a question I have been thinking about for a while. Out of curiosity, What problems are best apply a Bayesian model using PYMC vs a machine learning model such using XGBoost or is it best to apply to both and see the results or use a hybrid solution?

Thank You

1 Like

I do a lot of BART models and I have experimented with XGBoost as well and compared both. It is really easy to get XGBoost to overfit. There are a lot of parameters that you can tune to introduce regularization but you have to either guess or do a grid search with cross validation. No matter what parameters you use XGBoost doesn’t give you uncertainties.

BART on the other hand can just be run ā€œout of the boxā€ with the default settings and it will likely give you a good model with uncertainties. The regularization is built in already.

I do use a hybrid approach where I use XGBoost with heavy L1 regularization and low max depth, put those results through shap to rank the features from most to least important and include any interactions, then put the top n/2 columns into a BART model. I deal with data that I know at least half the columns are irrelevant to my dependent variable so that’s what works.

More generally I just like Bayesian models more because they give you distributions for everything and I find it to be more aligned with how people think intuitively compared to frequentist statistics. For example a Bayesian credible interval is what most (non stats) people think a frequentist confidence interval is.

3 Likes

Thanks Mike, really appreciate your answer! I have been thinking about this question for a while.

2 Likes

This is best illustrated with which questions a bayesian design is trying to answers vs a frequentist one .

To paraphrase a quote from a bayesian author i really like: ā€œthe statistics wars are overā€, yet the use cases where you choose one persist

…assuming you are not in (many) common scenarious using uniform priors / and or copious amount of data :wink:

In general:

  • A Bayesian wants to be able to provide direct inference on the parameters (or predictions) of the model by exploiting bayes rule.* This means we combine the prior and the likelihood to produce probability distributions for a quantity of interest. This allows us to say 'conditional on my model and the data I have seen, there is a 95 percent chance some value I’m interested in lives between [a,b].

  • For the Frequentist, the goal is to bound with some probability the number of times you’d be ā€˜surprised’ by the outcomes of an experiment. I.e.-they want to limit the number of times a confidence interval does not contain a true paramter value if you were to resample from the population and re-do the analysis.

Even So…

Where it gets interesting is that, for many applications and for much of the theory where you might want to model something’ bayes is 'ā€˜the bigger mathematical construct’-and it turns out that many frequentist approaches are implicitly bayesian given your choice of prior.

It’s ALSO true thanks to well known theorems like Berstien Von mises, that the prior and likelihood functions become (basically) independent for large numbers of samples.

So at the end of the day, approaches might end up agreeing with one another (the confidence interval for a regression coefficient, for instance).

You were trying to make a point?

I probably left you with more questions than answers, but the main takeaway I wanted to leave you (at a very high level) was both schools try to answer different questions and are better understood not as mutually exclusive schools of thought, but as related ones.

As a caveat though, both do tend to struggle answering certain questions in various paradigms; but that’s the tldr. There’s a lot of math stat/statistical information theory that unites them in a broad range of application (like say…glms)

I do think the reason why bayesian approaches make sense is because the layperson naturally interprets statistical construct like a bayesian would phrase them. So in general across industry bayes is a better fit imo. If you work in certain industries, and have the cash by all means get freqqyyyy.

**For more information I highly reccomend Statistical Rethinking, Mcelreath and Frank Harril’s excellent books/talks/lectures. There’s too many of the latter to list.

*** Also, to disclaim-I should mention that yes, there are legitimate mathematical differences and interpretations in the motivation of the schools that will turn up in some circumstances. My intention was to give a very high level. We can talk about the philosophy of probability and argue if it exists or not in another thread.

1 Like

Personally, I think that what makes bayesian boosting attractive is that instead of dark magic that comes from cross validation (…which is more deeply related to bayesian validation than most people are aware of…see gelman and vehtari’s wonderful writeups across many publications) is that:

  1. You can actually understand the grow policy from the definition of the joint of the model
  2. Provided that you don’t have too many high leverage points, out of sample error quantification is loads less expensive than cv. Although…you might need to use it at some point anyway if that’s not the case.
  3. Bayesian variable selection is a lot more stable than freq ones-which is basically a crapshot. (i use this term verrrry carefully, as I am not attaching ANY causal implications amongst the selected variables)
1 Like

I think when you have a mechanistic (i.e., scientific) model of a process, then you can get a lot more mileage out of Bayes than throwing something black box at the problem, particularly when there is not a lot of data. One example where this is the case is for hieararchical models, where there is a population of individuals and you want to smooth using the population. This has nothing to do with the whole frequentist vs. Bayesian system. Here’s a tutorial I wrote in Stan applying this to baseball. Now ask yourself how you’d use something like XGBoost to predict someone’s end of year stats given their first 50 at bats.

Original Stan version:

PyMC translation:

Both link to the original 1975 Efron and Morris paper that provides an empirical Bayes approach (which if you don’t know your confusing stats terminology, is a frequentist technique of using the data to set regularization—my case study just casts it into standard fully Bayesian inference).

We see this kind of thing all the time in problems like epidemiology, pharmacology, sports stats, etc.

XGBoost is good when you have a problem that’s basically a non-hierarchical, non-linear regression of unknown form and you don’t have enough data to fit a state of the art neural network. Hence its dominance in the kinds of problems you see on Kaggle. BART is very similar in spirit. You can write down the Bayesian inference problem for BART, but like for many problems like LDA or even simple mixture modeling in high dimensions, the posterior is so combinatorially multimodal that you’ll never be able to sample it properly.

2 Likes

Yeah…I’ve tried to tweak bart to do hierarchical modeling, and while i won’t claim that it’s useless…because there are way smarter people out there than me, it’s been quite cumbersome for my use cases.

I understand that for some regimes, you could define bart in a manner that nodes are glms/gps to remedy this-but I worry about getting enough samples into a particular node to make any sense in my work.

For (2), you can use approximate leave-one-out cross-validation, which has built-in self validation to check that the approximation is accurate. It’s coded up in ArviZ.

For (3), what do you mean by more stable? You mean under changes in the data or in changes to the prior parameters? You can see from the path diagrams used in lasso (L1 regularized regression) that frequentist variable selection done that way is very sensitive to the strength of the shrinkage.

There is a form of Bayesian boosting, or more technically bagging (i.e., bootstrap aggregation), which is way to expand model posteriors under model misspecification, as described in this paper:

You can do this pretty easily with PyMC.

HI again Professor Carpenter! Super stoked to be speaking with you-I’ve been a huge fan of your work! Let me know if I misspoke anywhere or fudged something up!

  • (2) I may not have been clear enough: I totally agree with you! And kudos to the Arviz team for making it so easy. I mentioned cross validation here may be necessary only if we are relying on PSIS or something for our loo estimation and have some high k values- and we may have our hands tied to not try some other suggested remedies that might include transformation of our MCMC draws, analysis of the PSIS behavior if we leave out that handful of problematic observations, etc etc.

  • For (3) I def could have been more specific-I meant stable with respect to the amount of shrinkage, under resampling of a fixed experiment size. Here is a link I’ve used in this past to find some ā€˜proof by simulation’: https://youtu.be/DF1WsYZ94Es?t=1547. It’s been awhile since I’ve looked at the math in detail, especially with respect to the adjusted p values for penalized regression

One of the bigger reasons why I decided to start moving into more bayesian settings was because I wanted to justify imputation and variable selection more rigorously in my work xD.

Thank you for the resource on bayesbag! This is something I’d like to get better at from an understanding standpiont