Save and Load a BART model

Hi, I am trying to save and load a trained model for posterior predictions with a model that contains a BART RV. I am running into multiple issues that is making this task seem not possible. If anyone could provide me with some input on what I might be doing wrong I would appreciate it.

from pathlib import Path
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb
import cloudpickle as cpkl
print(f"Running on PyMC v{pm.__version__}")
print(f"Running on PyMC-BART v{pmb.__version__}")


Using the bikes example for simplicity.

    bikes = pd.read_csv(Path("..", "data", "bikes.csv"))
except FileNotFoundError:
    bikes = pd.read_csv(pm.get_data("bikes.csv"))

features = ["hour", "temperature", "humidity", "workingday"]

X = bikes[features]
Y = bikes["count"]

xt = X[0:10]
yt = Y[0:10]

Baseline model

with pm.Model() as model_bikes:
    xdata = pm.MutableData("xdata", X)
    a = pm.Exponential("a", 1)
    mu_ = pmb.BART("mu_", xdata, np.log(Y), m=20)
    mu = pm.Deterministic("mu", pm.math.exp(mu_))
    y = pm.NegativeBinomial("y", mu=mu, alpha=a, observed=Y, shape=xdata.shape[0])
    idata_bikes = pm.sample(random_seed=99, draws=100, tune=100, compute_convergence_checks=False)

Used the clouldpickle to save the trace. I was having some weird issues with to_netcdf (basically the file stayed open and I couldn’t overwrite it when trying things out, it was easier to just pickle).

with open('test4.pkl', mode='wb') as file:
   cpkl.dump(idata_bikes, file)

with open("test4.pkl", mode="rb") as file:
    idata4 = cpkl.load(file)

Test the posterior predictions on the baseline model using the baseline idata and the saved/reload idata.
Both work fine.
It also works to set the new data when using the baseline model.

with model_bikes:
    pm.set_data({"xdata": xt})
    post1 = pm.sample_posterior_predictive(idata_bikes, var_names=["mu", "y"])

with model_bikes:
    pm.set_data({"xdata": xt})
    post2 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])

This is where things start to get difficult. Based on other examples I should be able to create a new model (untrained) and then get posterior predictions with the saved traced.

New model
Input data is the same as the input data in the baseline model.

with pm.Model() as model2:
    xdata2 = pm.MutableData("xdata", X)
    a2 = pm.Exponential("a", 1)
    mu_2 = pmb.BART("mu_", xdata2, np.log(Y), m=50)
    mu2 = pm.Deterministic("mu", pm.math.exp(mu_2))
    y2 = pm.NegativeBinomial("y", mu=mu2, alpha=a2, observed=Y, shape=xdata2.shape[0])

NOTE I named the new variable with suffix 2 because I thought there maybe was a chance that variables from the baseline model were causing issues, but that didn’t make a difference.

Getting the posterior predictive on the full data input into the model using the baseline trace and the saved trace.

with model2:
    post3 = pm.sample_posterior_predictive(idata_bikes, var_names=["mu", "y"], )

with model2:
    # pm.set_data({"xdata": xt})
    post5 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"], )

This seems to work well enough, with outputs being relatively close to the outputs of the baseline model.

However, when I try to set new data with the new model I an unable too and get a shape error. This seems odd, because setting new data on the baseline model doesn’t throw a shape error, so I am not sure where this error is arising from.

with model2:
    pm.set_data({"xdata": xt})
    post4 = pm.sample_posterior_predictive(idata_bikes, var_names=["mu", "y"])

with model2:
    pm.set_data({"xdata": xt})
    post6 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])


	"name": "ValueError",
	"message": "size does not match the broadcast shape of the parameters. (10,), (10,), (348,)\nApply node that caused the error: nbinom_rv{0, (0, 0), int64, True}

So I tried instantiating the model with the new data size

with pm.Model() as model3:
    xdata2 = pm.MutableData("xdata", xt)
    a2 = pm.Exponential("a", 1)
    mu_2 = pmb.BART("mu_", xdata2, np.log(yt), m=50)
    mu2 = pm.Deterministic("mu", pm.math.exp(mu_2))
    y2 = pm.NegativeBinomial("y", mu=mu2, alpha=a2, observed=yt, shape=xdata2.shape[0])

And this runs

with model3:
    # pm.set_data({"xdata": xt})
    post6 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"], )
with model3:
    # pm.set_data({"xdata": xt})
    post4 = pm.sample_posterior_predictive(idata_bikes, var_names=["mu", "y"], )

However the output predictions are not relatively close to what I expect. In fact the output of the BART RV is the same across all observations, so it doesn’t appear that the BART tree structure from the prior training is being used at all.

Finally I tried to pickle the whole model. The model with save and load, but when I try to do anything with the model I receive an error that I believe is associated with the multiprocessing in training the BART model.
NOTE: it says file not found error, but the file will load, it is only one I try to do something with it that it starts having issues.

"name": "FileNotFoundError",
"message": "[Errno 2] No such file or directory\nApply node that caused the error: BART_rv{1, (2, 1, 0, 0, 0, 1), floatX, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7FF34E558900>), [], 11, xdata, [2.7725887 ... .49650756], 20, 0.95, 2.0, [])\nToposort index: 1\nInputs types: [RandomGeneratorType, TensorType(int64, shape=(0,)), TensorType(int64, shape=()), TensorType(float64, shape=(None, None)), TensorType(float64, shape=(348,)), TensorType(int8, shape=()), TensorType(float64, shape=()), TensorType(float32, shape=()), TensorType(float64, shape=(0,))]\nInputs shapes: ['No shapes', (0,), (), (10, 4), (348,), (), (), (), (0,)]\nInputs strides: ['No strides', (8,), (), (8, 80), (8,), (), (), (), (8,)]\nInputs values: 

If anyone has had success in saving and reloading a model that has a BART RV and wouldn’t mind sharing how they did it, that would be amazing.

Also of note I have been able to successfully save and load with other models that don’t include the BART RV, so I am not convinced it is a OS problem.


Hi, just to follow-up I think I found a work around to this. I was able to convert to list and pickle the all_trees object from the RV. This object does not have any issues w/ multiprocessing/connection that I see with the full BARTRV or the Model when pickled and loaded. Then I can use the pmb.utilites._sample_posterior(...) function to get draws from the tree structure, which as it appears is how the pdp utility function works.

all_trees = list(model_bikes.mu_.owner.op.all_trees)
with open('test_all_tree.pkl', mode='wb') as file:
   cpkl.dump(all_trees, file)

Restart session and reload modules

with open('test_all_tree.pkl', mode='rb') as file:
   all_tree2 = cpkl.load(file)

rng = np.random.default_rng()
xt_2 = pt.constant(xt)
smpl = pmbu._sample_posterior(all_tree2, xt_2, rng, size = 400, shape=1)
mu_smpl = smpl.mean(0)
ex_mu = pm.math.exp(mu_smpl).eval()

ex_mu appears to appropriately match the pm.sample_posterior_distribution(...) from the trained model and with the updated covariate matrix.

For my actual project, I only care about the mu value output from the BARTRV, so I think that is appropriate method to predict mu with a new covariate matrix. If I needed the y value I believe it could be extended with a sample from whatever the final distribution is w/ the mu value inputed, as I have seen in other examples.

If anyone can confirm that this is an appropriate way to save/load/predict from the BARTRV or provide me with an alternative method that would be great!


@aloctavodia ?

Sorry for the late reply, I was very busy with other stuff, still are actually so I am just answering to tell you that the workaround seems fine to me. I will try next week to inspect this issue with more time and see if I can provide a better answer or a more general solution. I will do it on GitHub BART model save, reload and new predictions · Issue #123 · pymc-devs/pymc-bart · GitHub