Non-reproducible predictions using PyMC-bart

Hi,

I’m using PyMc-BART to fit a BART model to some data. I’ve ran a cross validated parameter grid search and want to rebuild the best models to evaluate against the test set.

However, when using the same data I get slightly different predictions and performance metrics, even when random state is set for each chain and posterior sample.

Reproducable via:

tr_RMSE = []
va_RMSE = []

for _ in range(5):
    
    with pm.Model() as modelg:
        des = pm.MutableData("des",X_tr.to_numpy())
        σ = pm.HalfNormal('σ', 1)
        μ = pmb.BART('μ', des, g_tr, m=10)
        y = pm.Normal('y', μ, σ, observed=g_tr, shape=µ.shape)
        trace_g = pm.sample(1000, tune=1000, return_inferencedata=False, chains=3, cores=3, random_seed = [42, 56, 69], progressbar=0)
        gtrain_posterior = pm.sample_posterior_predictive(
                trace=trace_g, random_seed=42
            )

    with modelg:
        des.set_value(X_va.to_numpy())
        gval_posterior = pm.sample_posterior_predictive(
            trace=trace_g, random_seed=42
        )

    tr_RMSE.append(np.sqrt(metrics.mean_squared_error(g_tr, gtrain_posterior.posterior_predictive.y.mean(axis=0).mean(axis=0).values)))
    va_RMSE.append(np.sqrt(metrics.mean_squared_error(g_va, gval_posterior.posterior_predictive.y.mean(axis=0).mean(axis=0).values)))

This gives the following RMSE metrics:
tr_RMSE va_RMSE
0 0.423 0.385
1 0.429 0.401
2 0.471 0.404
3 0.461 0.432
4 0.435 0.379

Hopefully I’m missing something obvious! Thanks in advance!