Why use BART instead of a Gaussian Process?

Looking at BART, this seems like a pretty weird way to approximate a Gaussian process, assuming it converges to the same result in a limit. My question is why you’d use BART instead of a GP directly. Is BART generally faster, or is there some other advantage to it? If it’s speed, how does BART compare to other methods for quickly approximating a Bayesian regression?

1 Like

I didn’t know BART was trying to approximate a GP

CC @aloctavodia

1 Like

Hi @ParadaCarleton thanks for your question.

Many models can be seen as particular instances or approximations to other models. For example neural networks, splines and even liner model csn be seen as GPs under the proper conditions. This property of one model been an approximation to another one can be very useful to theoretically gain a better understanding of the statistical properties of those models. But I wil not say that BART is trying to approximate a GP, as that’s not the goal of BART. Also there have been some proposals combining BART and GPs, where for example the values of the leaf nodes are sampled from a GP.

As you mentioned in your question other features come into play when working in practice. In the literature, BART is generally used as a method that requieres little user input and provides very good fit/predictions. So it seems like a good model/mehod for a low investment.

Other models like GPs generally requiere more user involvement, and sometimes that investment is needed to get better predictions or better understanding. Generally speking adding structure to a model should help to get a better model. BART by itself is not able to provide much structure, we may add ways to add more structure in the future.

Benchmark van be tricky and I don’t have a one that I have run with our implementation (maybe we should) and I don’t remember seing a benchmark from other implementation (but hey are probably out there), so I can not provide any accurate estimates of the speed diferences between BART and other models. Also speed will depends on the implementations, for example there is an XBART package in R, that has been designed for speed.

For our implementation we know we can improve speed and memory usage, and we have some ideas on how to do it. But so far the main focus has been on making a flexible BART implementation that can be used as a building block of user defined models. This contrast with all or most other implementations that focus on conjugate priors so a model/implementation for a poisson likelihood is different from the one wih a gaussian likelihood.

7 Likes

I see, that’s helpful. The reason I’m wondering when I should use each one is because they seem to be doing fundamentally the same task (nonparametric estimation of a function). When will BART outperform a Gaussian process, if ever? To me it feels like BART is trying to do the same thing as a GP, but with a much less natural connection to the underlying data-generating process. But as you mentioned, there might be practical reasons to favor BART, like performance or simplicity.

My understanding (which is far shallower than @aloctavodia ) is that BART typically requires very little in the way of tweaking, which may be seen as one advantage over GPs (depending on the user and the scenario). Relatedly, I don’t think BART makes any assumptions about smoothness (which could be good or bad, depending). And BART (like other tree-based methods) may handle multinomial outcomes more naturally. But that’s just a bit of speculation based on generalizing from other, related techniques.

3 Likes

Thank you for explanation! Just a quick follow-up question: although BART does not impose any structural dependence, it still have the parameter “m”, which reflects the complexity. My questions is that: can we estimate the complexity parameter, e.g., by using information criteria? For example, less structured tree could fit the data equivalently well compared to a more layered tree. So I wonder if the overfitting could be avoided.

@aakhmetz Indeed, we still have m. As this parameter increases, we have more trees to fit the data, but at the same time the values on the leaves nodes are shrunk.
We did some test in order to provide guidance to pick m. For a few functions and for a few values of m (in the range of 10 to 200) we performed 5-fold cross validation and LOO (PSIS-LOO-CV). We choose the first because it is usually recommended in the literature for BART, And we choose LOO (PSIS-LOO-CV), because it is a fast and reliable way to approximate one-leave-out CV, as we need to fit the model just once. We observed that both method qualitatively agree, indicating that LOO can be used in practice to select m. At least for these tests, we observed that the ELPD (for LOO) keeps increasing with m. The same for the rmsd (for 5-fold CV). We sometimes observed “saturation”, thus sometimes a model with m=50 or m=100 provides very similar predictive accuracy than m=200. Details are here [2206.03619] Bayesian additive regression trees for probabilistic programming (and in an updated version that will be available in one or two weeks).

2 Likes

oh, quite nice - many thanks!

I’ve done some work on this, so I think I can give a good answer.

You are right that, in the limit of infinite trees, BART converges to a GP. The point of BART, though, is not approximating a GP. Indeed, sending the number of trees to infinity decreases performance on average. As also reported in the original BART paper (Chipman et al. 2010), performance initially quickly increases with the number of trees, peaks, and then slowly decreases.

The reason that the GP limit of BART is not as good as BART is the same reason the GP limit of Neural Networks is not as good as NNs. This reason can be stated in various ways:

  • Stat jargon: multivariate Normal distributions don’t do variable selection. Applying Bayes to a Normal won’t get you something that tries to use as few variables as possible.
  • ML jargon: GPs don’t learn features. If you define the GP on some X with a given kernel, the features of X to enshrine at are fixed in stone by the choice of kernel. NNs are good at finding what’s important in X themselves.
  • AI jargon: GPs are not hierarchical. There is not a simple hypothesis with 1/2 of the probability, then 2 more complex hypotheses with 1/4 of the probability, and so on. The probability is uniformly spread amongst a sea of hypotheses with similar complexity. So, unlike the Solomonoff prior, applying Bayes to this prior takes a lot of time to learn generic things, it’s like every detail has to be hammered into the GP datapoint by datapoint, it won’t try to generalize by itself.

If you have x in R^n ~ N(0, I), the Normal distribution really likes |x|^2 = n, and wants to spread the magnitude of the vector around all components. Conditioning the Normal on a linear combination of x, just sends you to another Normal that still likes to spread probability amongst all components uniformly w.r.t. the same measure as before.

Geometrically, you can imagine it in the following way. Diagonalize your GP, i.e., find a set of basis functions f_i(X) such that your f ~ GP can be written as f = Sum_i u_i f_i(x), where u ~ N(0, I) (Mercer’s theorem). Consider the typical set of this Normal, which is a spherical surface (because |u|^2 = sum_i u_i^2 = chi^2 and the chi^2 distribution concentrates around its mean for n->oo).

When you add some datapoints and apply Bayes, you are restricting the Normal variable u to a subspace. Imagine a sphere that is cut by a plane: the intersection is a circle, oriented with its axis passing by the center of the sphere. Whatever the orientation of the plane, the result is a sphere of lesser dimension and, by symmetry, the probability is still spread uniformly around the new sphere.

In practical usage, a GP is never really just a GP, because you have free parameters in the kernel. So your overall prior is a mixture of GPs, one for each possible set of kernel parameters. Since the leaves of BART (i.e., the values in each grid cell the trees split the space with) are a priori Normal, BART is too a mixture of GPs, one for each tree ensemble configuration.

Even though standard GPs and BART are both mathematically mixtures of GPs, BART is a much richer mixture that your run-of-the mill GP with 10 free hyperparameters. This increased flexibility allows it to reach higher performance in many cases.

In particular, if you don’t already know the features to look at, or, in other words, what is the right kernel choice, then BART is probably going to do better than a GP.

Next, there’s the computational aspect. The so-called “naive implementation” of GPs scales as O(n^3). On paper, BART scales as O(n), although this is not completely true in practice. Anyway, BART scales better by default, and in my empirical experience it’s faster. You can scale GPs with approximations, but approximations typically reduce the statistical performance.

I’ve some slides about this on my website, and I have a software package (lsqfitgp) that implements the infinite tree limit of BART as a GP. With that software, I showed that you can actually reach the black-box performance of BART with its GP limit, but only if you tune more hyperparameters in the GP version.

6 Likes

Sorry to revive this but @aloctavodia @Gattocrucco can BART deal with PyMC random variables as input? For example latent variables when you’re regressing y vs. x but x and y both have measurement uncertainties (e.g., the example in Chapter 14 of McElreath’s “Statistical Rethinking” textbook)? If not, perhaps this is a case where gaussian processes are more useful (e.g., for non-parametric regression)?

This post suggests BART cannot natively deal with random/uncertain inputs: Latent Variable BART | Vittorio Orlandi

That’s a good point, PyMC-BART and other BART implementations I know, don’t allow the inputs to be RVs.

1 Like