Missingness in BART?

Hi all,

I am trying to move from the R-package bartMachine to PyMC-BART but am running into a major hurdle. My data comes from very heterogeneous sources and there are a lot of missing values. In bartMachine, missingness is incorporated into the BART decisions rules and the model converges very well. As far as I can tell, there is nothing like this implemented in PyMC-BART, as any missing values will create an error like Masked arrays or arrays with nan entries are not supported.

Now, someone might ask why not just stick with bartMachine. The reason is that the data is also very overdispersed and a Negative-Binomial would make much more sense than the normality assumptions that come with BART. So, on paper, PyMC is the library that seems to make the most sense.

Given the situation, I would have the following questions:

  1. Does anyone have any advice or tricks on how to handle NA values with PyMC-BART? I have tried various off-the-shelf imputation methods from sklearn and they do not lead to satisfactory results.

  2. I’d be happy to work on incorporating missingness into the PyMC-BART library. Would this be a welcome feature? Would someone who has some experience with the code base be available to point me to the relevant parts of the library and provide me with some light supervision?

For reference, this is the baseline model I’m trying to fit:

split_rules = [
pmb.OneHotSplitRule
if (c in [‘geouid’, ‘toronto’] or c.endswith(‘_isna’))
else pmb.ContinuousSplitRule
for c in X_train.columns
]

sampling_dict = {
“tune”: 1000,
“draws”: 4000,
“discard_tuned_samples”: True,
“compute_convergence_checks”: True,
“chains”: 4,
“cores”: 4,
“target_accept”: 0.99,
“random_seed”: RANDOM_SEED,
“nuts_sampler_kwargs”:{
“max_depth”: 15
}
}

with pm.Model() as model_norm:
X_data = pm.Data(“X_data”, X_train)
Y_data = pm.Data(“Y_data”, Y_train)
μ = pmb.BART(“μ”, X_data, Y_data, m=300, alpha=0.95, beta=2,
split_rules = split_rules)
s = pm.HalfNormal(“sigma”, sigma=0.05)
y = pm.Normal(
“y”,
mu=μ,
sigma=s,
observed=Y_data
)
trace_norm = pm.sample(**sampling_dict)

As you can see from the split_rules, I’ve introduced some dummy NA columns to track missingness. This did improve convergence a bit, but didn’t resolve it. I’ve tried a lot of other things, from different priors and hyperparameters, to various imputation methods and hierarchical models, but I can’t get this to really converge with PyMC, whereas it fits very smoothly with bartMachine.

CC @aloctavodia

Hi @maybemkl

Currently, PyMC-BART filters out missing values (see pymc-bart/pymc_bart/pgbart.py at 941d258e679e2c57b0571dc953e6bea657706fdb · pymc-devs/pymc-bart · GitHub). A better way to handle missingness is a nice feature; happy to guide you.
Have you tried running your model without OneHotSplitRule, just the default splitting rule? At least for some tests I did in the past, I got better results with the default ones.