Looking at BART, this seems like a pretty weird way to approximate a Gaussian process, assuming it converges to the same result in a limit. My question is why you’d use BART instead of a GP directly. Is BART generally faster, or is there some other advantage to it? If it’s speed, how does BART compare to other methods for quickly approximating a Bayesian regression?
I didn’t know BART was trying to approximate a GP
Hi @ParadaCarleton thanks for your question.
Many models can be seen as particular instances or approximations to other models. For example neural networks, splines and even liner model csn be seen as GPs under the proper conditions. This property of one model been an approximation to another one can be very useful to theoretically gain a better understanding of the statistical properties of those models. But I wil not say that BART is trying to approximate a GP, as that’s not the goal of BART. Also there have been some proposals combining BART and GPs, where for example the values of the leaf nodes are sampled from a GP.
As you mentioned in your question other features come into play when working in practice. In the literature, BART is generally used as a method that requieres little user input and provides very good fit/predictions. So it seems like a good model/mehod for a low investment.
Other models like GPs generally requiere more user involvement, and sometimes that investment is needed to get better predictions or better understanding. Generally speking adding structure to a model should help to get a better model. BART by itself is not able to provide much structure, we may add ways to add more structure in the future.
Benchmark van be tricky and I don’t have a one that I have run with our implementation (maybe we should) and I don’t remember seing a benchmark from other implementation (but hey are probably out there), so I can not provide any accurate estimates of the speed diferences between BART and other models. Also speed will depends on the implementations, for example there is an XBART package in R, that has been designed for speed.
For our implementation we know we can improve speed and memory usage, and we have some ideas on how to do it. But so far the main focus has been on making a flexible BART implementation that can be used as a building block of user defined models. This contrast with all or most other implementations that focus on conjugate priors so a model/implementation for a poisson likelihood is different from the one wih a gaussian likelihood.
I see, that’s helpful. The reason I’m wondering when I should use each one is because they seem to be doing fundamentally the same task (nonparametric estimation of a function). When will BART outperform a Gaussian process, if ever? To me it feels like BART is trying to do the same thing as a GP, but with a much less natural connection to the underlying data-generating process. But as you mentioned, there might be practical reasons to favor BART, like performance or simplicity.
My understanding (which is far shallower than @aloctavodia ) is that BART typically requires very little in the way of tweaking, which may be seen as one advantage over GPs (depending on the user and the scenario). Relatedly, I don’t think BART makes any assumptions about smoothness (which could be good or bad, depending). And BART (like other tree-based methods) may handle multinomial outcomes more naturally. But that’s just a bit of speculation based on generalizing from other, related techniques.
Thank you for explanation! Just a quick follow-up question: although BART does not impose any structural dependence, it still have the parameter “m”, which reflects the complexity. My questions is that: can we estimate the complexity parameter, e.g., by using information criteria? For example, less structured tree could fit the data equivalently well compared to a more layered tree. So I wonder if the overfitting could be avoided.
@aakhmetz Indeed, we still have m. As this parameter increases, we have more trees to fit the data, but at the same time the values on the leaves nodes are shrunk.
We did some test in order to provide guidance to pick m. For a few functions and for a few values of m (in the range of 10 to 200) we performed 5-fold cross validation and LOO (PSIS-LOO-CV). We choose the first because it is usually recommended in the literature for BART, And we choose LOO (PSIS-LOO-CV), because it is a fast and reliable way to approximate one-leave-out CV, as we need to fit the model just once. We observed that both method qualitatively agree, indicating that LOO can be used in practice to select m. At least for these tests, we observed that the ELPD (for LOO) keeps increasing with m. The same for the rmsd (for 5-fold CV). We sometimes observed “saturation”, thus sometimes a model with m=50 or m=100 provides very similar predictive accuracy than m=200. Details are here [2206.03619] Bayesian additive regression trees for probabilistic programming (and in an updated version that will be available in one or two weeks).
oh, quite nice - many thanks!