Most appropriate problems to solve for Bayesian models vs Machine learning model

This is a question I have been thinking about for a while. Out of curiosity, What problems are best apply a Bayesian model using PYMC vs a machine learning model such using XGBoost or is it best to apply to both and see the results or use a hybrid solution?

Thank You

2 Likes

I do a lot of BART models and I have experimented with XGBoost as well and compared both. It is really easy to get XGBoost to overfit. There are a lot of parameters that you can tune to introduce regularization but you have to either guess or do a grid search with cross validation. No matter what parameters you use XGBoost doesn’t give you uncertainties.

BART on the other hand can just be run ā€œout of the boxā€ with the default settings and it will likely give you a good model with uncertainties. The regularization is built in already.

I do use a hybrid approach where I use XGBoost with heavy L1 regularization and low max depth, put those results through shap to rank the features from most to least important and include any interactions, then put the top n/2 columns into a BART model. I deal with data that I know at least half the columns are irrelevant to my dependent variable so that’s what works.

More generally I just like Bayesian models more because they give you distributions for everything and I find it to be more aligned with how people think intuitively compared to frequentist statistics. For example a Bayesian credible interval is what most (non stats) people think a frequentist confidence interval is.

3 Likes

Thanks Mike, really appreciate your answer! I have been thinking about this question for a while.

2 Likes

This is best illustrated with which questions a bayesian design is trying to answers vs a frequentist one .

To paraphrase a quote from a bayesian author i really like: ā€œthe statistics wars are overā€, yet the use cases where you choose one persist

…assuming you are not in (many) common scenarious using uniform priors / and or copious amount of data :wink:

In general:

  • A Bayesian wants to be able to provide direct inference on the parameters (or predictions) of the model by exploiting bayes rule.* This means we combine the prior and the likelihood to produce probability distributions for a quantity of interest. This allows us to say 'conditional on my model and the data I have seen, there is a 95 percent chance some value I’m interested in lives between [a,b].

  • For the Frequentist, the goal is to bound with some probability the number of times you’d be ā€˜surprised’ by the outcomes of an experiment. I.e.-they want to limit the number of times a confidence interval does not contain a true paramter value if you were to resample from the population and re-do the analysis.

Even So…

Where it gets interesting is that, for many applications and for much of the theory where you might want to model something’ bayes is 'ā€˜the bigger mathematical construct’-and it turns out that many frequentist approaches are implicitly bayesian given your choice of prior.

It’s ALSO true thanks to well known theorems like Berstien Von mises, that the prior and likelihood functions become (basically) independent for large numbers of samples.

So at the end of the day, approaches might end up agreeing with one another (the confidence interval for a regression coefficient, for instance).

You were trying to make a point?

I probably left you with more questions than answers, but the main takeaway I wanted to leave you (at a very high level) was both schools try to answer different questions and are better understood not as mutually exclusive schools of thought, but as related ones.

As a caveat though, both do tend to struggle answering certain questions in various paradigms; but that’s the tldr. There’s a lot of math stat/statistical information theory that unites them in a broad range of application (like say…glms)

I do think the reason why bayesian approaches make sense is because the layperson naturally interprets statistical construct like a bayesian would phrase them. So in general across industry bayes is a better fit imo. If you work in certain industries, and have the cash by all means get freqqyyyy.

**For more information I highly reccomend Statistical Rethinking, Mcelreath and Frank Harril’s excellent books/talks/lectures. There’s too many of the latter to list.

*** Also, to disclaim-I should mention that yes, there are legitimate mathematical differences and interpretations in the motivation of the schools that will turn up in some circumstances. My intention was to give a very high level. We can talk about the philosophy of probability and argue if it exists or not in another thread.

1 Like

Personally, I think that what makes bayesian boosting attractive is that instead of dark magic that comes from cross validation (…which is more deeply related to bayesian validation than most people are aware of…see gelman and vehtari’s wonderful writeups across many publications) is that:

  1. You can actually understand the grow policy from the definition of the joint of the model
  2. Provided that you don’t have too many high leverage points, out of sample error quantification is loads less expensive than cv. Although…you might need to use it at some point anyway if that’s not the case.
  3. Bayesian variable selection is a lot more stable than freq ones-which is basically a crapshot. (i use this term verrrry carefully, as I am not attaching ANY causal implications amongst the selected variables)
1 Like

I think when you have a mechanistic (i.e., scientific) model of a process, then you can get a lot more mileage out of Bayes than throwing something black box at the problem, particularly when there is not a lot of data. One example where this is the case is for hieararchical models, where there is a population of individuals and you want to smooth using the population. This has nothing to do with the whole frequentist vs. Bayesian system. Here’s a tutorial I wrote in Stan applying this to baseball. Now ask yourself how you’d use something like XGBoost to predict someone’s end of year stats given their first 50 at bats.

Original Stan version:

PyMC translation:

Both link to the original 1975 Efron and Morris paper that provides an empirical Bayes approach (which if you don’t know your confusing stats terminology, is a frequentist technique of using the data to set regularization—my case study just casts it into standard fully Bayesian inference).

We see this kind of thing all the time in problems like epidemiology, pharmacology, sports stats, etc.

XGBoost is good when you have a problem that’s basically a non-hierarchical, non-linear regression of unknown form and you don’t have enough data to fit a state of the art neural network. Hence its dominance in the kinds of problems you see on Kaggle. BART is very similar in spirit. You can write down the Bayesian inference problem for BART, but like for many problems like LDA or even simple mixture modeling in high dimensions, the posterior is so combinatorially multimodal that you’ll never be able to sample it properly.

2 Likes

Yeah…I’ve tried to tweak bart to do hierarchical modeling, and while i won’t claim that it’s useless…because there are way smarter people out there than me, it’s been quite cumbersome for my use cases.

I understand that for some regimes, you could define bart in a manner that nodes are glms/gps to remedy this-but I worry about getting enough samples into a particular node to make any sense in my work.

For (2), you can use approximate leave-one-out cross-validation, which has built-in self validation to check that the approximation is accurate. It’s coded up in ArviZ.

For (3), what do you mean by more stable? You mean under changes in the data or in changes to the prior parameters? You can see from the path diagrams used in lasso (L1 regularized regression) that frequentist variable selection done that way is very sensitive to the strength of the shrinkage.

There is a form of Bayesian boosting, or more technically bagging (i.e., bootstrap aggregation), which is way to expand model posteriors under model misspecification, as described in this paper:

You can do this pretty easily with PyMC.

HI again Professor Carpenter! Super stoked to be speaking with you-I’ve been a huge fan of your work! Let me know if I misspoke anywhere or fudged something up!

  • (2) I may not have been clear enough: I totally agree with you! And kudos to the Arviz team for making it so easy. I mentioned cross validation here may be necessary only if we are relying on PSIS or something for our loo estimation and have some high k values- and we may have our hands tied to not try some other suggested remedies that might include transformation of our MCMC draws, analysis of the PSIS behavior if we leave out that handful of problematic observations, etc etc.

  • For (3) I def could have been more specific-I meant stable with respect to the amount of shrinkage, under resampling of a fixed experiment size. Here is a link I’ve used in this past to find some ā€˜proof by simulation’: https://youtu.be/DF1WsYZ94Es?t=1547. It’s been awhile since I’ve looked at the math in detail, especially with respect to the adjusted p values for penalized regression

One of the bigger reasons why I decided to start moving into more bayesian settings was because I wanted to justify imputation and variable selection more rigorously in my work xD.

Thank you for the resource on bayesbag! This is something I’d like to get better at from an understanding standpiont

From what I understand, adjusting p-values for penalized regression like the lasso, or what the frequentists call ā€œpost-selection inference,ā€ is still a largely open problem in frequentist stats.

Most of the ā€œBayesianā€ approaches to multiple imputation are cut (like in BUGS), meaning that you do the imputation, then take the results of the imputation, and for each one, fit a Bayesian posterior, then throw all those posteriors together. This is only an approximation of what we would get if we just extended our generative model to the data and imputed as part of a joint model. That’s what happens when you just assign parameters to missing data and give them a model. It is far more in keeping with Bayesian philosophy to just build a fully generative model of the data and fit it jointly with everything else.

The practical problem is that we often want to do imputation by making it something like a conditional model. Even if we stick to unconstrained covariates we want to impute and give them a multivariate normal model (or slightly generalize to multivariate normal with link or distribution), the imputation is a pain computationally as you try to chop down the marginals of the multivariate normal to get inference for what you actually want. Same thing happens if you try to use a copula.

Generally, the problem of variable selection is intractable. It can be shown that variable selection is NP-hard in the general case. See, e.g., Natarajan et al., Sparse approximate solutions to linear systems or Foster et al., Variable selection is hard. Most things that involve choosing subsets of bigger sets wind up being NP-hard because there’s an exponentially sized decision space.

P.S. I haven’t been a professor since 1996 :slight_smile:. It’s a really hard job and I prefer research scientist roles, even though I love teaching.

Do you prefer Dr. Bob instead? I like to make sure I’m being respectful :slight_smile:

My first intro into Bayesian Imputation was Stef Van Burren’s flexible imputation, and I realized how much I had to revisit from my bayes elective in grad school. Generally I try to motivate a DAG for for the missing mechanism, but I’d be lying if I told you if sometimes I didn’t just hit it with MICE and average everything in a pinch. If you have any recommendations I’d love to read them! I’m always on the hunt for more resources (currently trying to revist some measure theory and math stat that industry has trained out of me)

Yeah, I’ve kind of given up on variable selection as a general rule; unless I’m being asked to model causal relationships. It doesn’t help that many times, people think ā€˜causal’ when they here ā€˜variable selection’. The only time I’m really looking at say, partial dependence so that I can eliminate variables is when I’m trying to reduce the cost of my model

I really haven’t seen anything recently that justifies me switching my workflows back to using penalized liklihood in this space either.

But I must admit that grad school was probably five years ago and I’ve been spending my time actually getting better at my bayes workflows :sweat_smile:. I’m not even sure what this would look like, because I imagine to motivate any sort of consistent or stable estimator, you’d need a way to generate a superset of the ā€˜truly non noise variables’. That doesn’t sound feasible in general (problems with causal identification nonewithstanding)

We’re in research—everyone’s on a first name basis, so just ā€œBobā€. Or ā€œRobertā€ if you’re mad at me or need to see my passport :slight_smile:

Not really other than to build a good model. Multiple imputation tends to be much more robust than single imputation because of the way averages work (i.e., if f is non-linear, then we do not have mean(f(x)) = f(mean(x)) ). So even uncertainties like variance (quadratic) don’t work out right. Also, even in high dimensional normal distributions, averages of draws are far more concentrated than draws.