PYMC-Bart Paper Models Standard Deviation Using BART, but how?

I am just thrilled with the availability of BART random variables in a probabilistic programming paradigm, the work here is tremendous (available paper here: https://arxiv.org/pdf/2206.03619.pdf)

There is one implementation detail in the paper that I am confused about. In Section 4.4 (Heteroscedasticity), the standard deviation is modelled as a BART random variable using this line:

σ_ = pmb.BART("σ_", X, np.log(Y), m=50)

To me, this line suggests that the BART random variable is modelled where the leaves of the trees are all modelling log(Y) (i.e. not any type of std. dev.). Thus, when this BART rv is passed to the next line,

σ = pm.Deterministic("σ", np.exp(σ_))

aren’t all of the values σ going to be draws of “Y” as opposed to draws of the standard deviation of Y?

I feel I am missing something fundamental. How will the leaves of the tree end up being estimates of log(\sigma) instead of log(Y)?

Thanks for any insight into this.

I have a similar question about this topic.

I constructed a toy model to better understand how BART actually works.

I used a simple dataset where the response variable y follows a normal distribution:

y∼Normal(x_1+x_2, (x^2_1+x_2^2)^2)

with x1 and x2​ as two independent randomly generated variables.
Here is the code used to generate the data:

X_train = np.random.rand(5000, 2)
mu = np.dot(X_train, np.array([[2],[1]])).flatten()
sigma = np.sum(np.square(X_train), axis=1)
Y_train = np.random.normal(mu, sigma)
n_obs = len(Y_train)

I then built a BART model as follows:

with pm.Model() as model:
    X_obs = pm.Data("X_obs", X_train)
    Y_obs = pm.Data("Y_obs", Y_train)
    ret_mean = pmb.BART("ret_mean", X=X_obs, Y=Y_obs, m=100)
    s = pmb.BART("s", X=X_obs, Y=Y_obs, m=100)
    ret_sigma = pm.Deterministic("ret_sigma", np.exp(s))
    pm.Normal(
        'y',
        mu=ret_mean,
        sigma=ret_sigma, observed=Y_obs
    )

This structure is similar to the heteroscedasticity example shared by @flyaflya.

During testing, I changed the input Y_obs to np.abs(Y_obs) in the variance component of the model:

s = pmb.BART("s", X=X_obs, Y=np.abs(Y_obs), m=100)

does not seem to affect the results.

This leads me to the following questions:

  1. How does the BART model actually learn the variance component?
  2. In the heteroscedasticity example, both the mean and variance are modeled using a single BART with multiple components, such as:
w = pmb.BART("w", X, Y, m=200, size=2)

However, if I want to use different sets of predictors for the mean and variance, is using two separate BART models the only option?
3. What should I do if the likelyhood is a StudentsT distribution? When I model the nu parameter, what should I input?

I would greatly appreciate any insights on this!

Hey,

So generally in BART, the standard deviation is usually estimated from either the sample standard deviation or the residual standard deviation: Introduction to Bayesian Additive Regression Trees - Juan Martín Loyola. You can scale your output with BART, and it’s common for continuous outcomes to center (it’s generally easier to get more accepted draws this way)

Edit. Oh I see where you are confused, re-writing
In 4.4, you are drawing values of the standard deviation in a linear mode for the MEAN l where BART is loosely forming a ‘covariance function’ on the observed values of y, given your covariate matrix.

To make things a bit more computationally feasible, you are taking the log of these variables and then fitting BART. In fact, even though we’ve started with some linear assumptions on the MEAN, we are approximating a Gaussian process with this step.

Second edit: wait this question is two years old

Sorry for my discombobulated response, I sufficiently confused myself, in the original question,

What is basically happening is that your model is being composed into a linear model where, loosely speaking, BART is a ‘multiplicative’ factor on the standard deviation. We’re just using normal theory to create many normal observations

It’s getting a little late here, but I will try to come back and give you some resources