How to use sample_posterior_predictive for out-of-sample prediction with BART?

Hi! Could you please help me with this problem?

I’m using pmb.predict() now for making predictions, but it’s much slower compare to using pm.sample_posterior_predictive(). Is there a reason why?

And for now, I only know how to use pm.sample_posterior_predictive() for in-sample prediction, but does not know how to use it for out-of-sample prediction. Is there a way to use it? Or is there a way that I can make predictions faster?

The following code is my attempt to use pm.sample_posterior_predictive() for out-of-sample data, and I got an unexpected result.

with pm.Model() as model_2:
    pred = pm.MutableData("pred", X_train)
    σ = pm.HalfNormal("σ", y_train.std())
    μ = pmb.BART("μ", pred.get_value(), y_train, m=50)
    y = pm.Normal("y", μ, σ, observed=y_train)
    idata_2 = pm.sample(random_seed=RANDOM_SEED)
with model_2:
    pm.set_data({"pred": X_test})
    idata_2 = pm.sample_posterior_predictive(
        idata_2,
        var_names=["μ", "σ"],
        return_inferencedata=True,
        predictions=True,
        extend_inferencedata=True,
        random_seed=rng,
    )

My test set has 1250 data points, and my training set has 3750 data points. The prediction dimension is unexpected.

Hi, yiyi-z.
I’m not team member but I posted similar question so I share the link.

I hope this post helps you.

so I’m told that BART can’t work with sample_posterior_predictive and aloctavodia, a member of team, suggested me to use “stats.norm”.
(And now I’m searching for how to use stats.norm. if you know how to use and don’t mind to share it, please share it to me and others.)

Thanks.

1 Like

Hi y_m, thank you for sharing the post! (sorry that I missed your reply for so many days)
I’m also very new to the PyMC, so please point out any mistakes if you notice, thank you!

For in-sample data, I use sample_posterior_predictive for making predictions.
What I do is like this:

with model_0:
    pm.sample_posterior_predictive(idata_0, extend_inferencedata=True, random_seed=rng)
train_pred_m0 = np.array(idata_0.posterior_predictive.to_array()).reshape(total_draws, num_data)

Please correct me if I understand this wrong. In my understanding, after we use the sample_posterior_predictive(), we can create this array that records the predictions for each datapoint and each draw.

If we want to use mean of these predictions as our final prediction for a data points, we can do this:

train_pred_mean_m0 = np.average(train_pred_m0, axis=0)

However, I still cannot figure out how to do out-of-sample prediction using sample_posterior_predictive(), I’m also using the pmb.predict() for out-of-sample predictions currently.
I also wonder why sample_posterior_predictive() seems to be faster than pmb.predict(). Could you please share some thoughts with me on this?

I see that in your post, you asked about is it correct that we do not use the whole posterior distribution. I’m not sure if you mean that the current pmb.predict() only gives us miu_pred but not sigma_pred, or that we do not use all the draws. If it’s the later one, I think it’s fine as long as we use a reasonable size for pmb.predict().

I compared the results of using all the draws vs using some of the draws, and the results I got were quite similar. The code I use for making predictions with all the draws is this: (I only modified some lines of the pmb.predict() source code to get this.)

# for every sample from the posterior, get the bart prediction
def bart_predict(idata, X):
    bart_trees = idata.sample_stats.bart_trees
    stacked_trees = bart_trees.stack(trees=["chain", "draw"])

    num_bart = stacked_trees.trees.size
    shape = stacked_trees.isel(trees=0).values[0].predict(X[0]).size

    pred = np.zeros((num_bart, X.shape[0], shape))

    for idx, p in enumerate(pred):
        for tree in stacked_trees.isel(trees=idx).values:
            p += np.array([tree.predict(x) for x in X])
    
    return pred

Hi, yiy-z.

However, I still cannot figure out how to do out-of-sample prediction using sample_posterior_predictive(), I’m also using the pmb.predict() for out-of-sample predictions currently.
I also wonder why sample_posterior_predictive() seems to be faster than pmb.predict(). Could you please share some thoughts with me on this?

I guess that sample_posterior_predictive() maybe product result without calling pmb.predict().
To be honest, I don’t know much about inside of this exactly.

I see that in your post, you asked about is it correct that we do not use the whole posterior distribution. I’m not sure if you mean that the current pmb.predict() only gives us miu_pred but not sigma_pred, or that we do not use all the draws. If it’s the later one, I think it’s fine as long as we use a reasonable size for pmb.predict().

Note that in this case we assumpted that likelihood function is the normal distribution so we actually estimated the miu’s and sigma’s posterior distribution of the parameters of Gaussian distribution.

Aloctavodia confirmed that pmb.predict() gives only samples from BART’s posterior as posterior distribution of miu.
We actually have to use samples generated from the whole posterior distribution to predict new data with probability.
So pmb.predict()'s result is not enough to predict with probability.
To begin with, we assumpted that likelihood function is the normal distribution.
Why not using sigma? :laughing:

But we can’t use this method with the BART of current version, so we have to get the result of samples from posterior distribution by these code.

Sorry for my poor English.
Thanks.

1 Like

Thank you for your kind shares!