Missingness in BART?

maybemkl · August 4, 2025, 12:55pm

Hi all,

I am trying to move from the R-package bartMachine to PyMC-BART but am running into a major hurdle. My data comes from very heterogeneous sources and there are a lot of missing values. In bartMachine, missingness is incorporated into the BART decisions rules and the model converges very well. As far as I can tell, there is nothing like this implemented in PyMC-BART, as any missing values will create an error like Masked arrays or arrays with nan entries are not supported.

Now, someone might ask why not just stick with bartMachine. The reason is that the data is also very overdispersed and a Negative-Binomial would make much more sense than the normality assumptions that come with BART. So, on paper, PyMC is the library that seems to make the most sense.

Given the situation, I would have the following questions:

Does anyone have any advice or tricks on how to handle NA values with PyMC-BART? I have tried various off-the-shelf imputation methods from sklearn and they do not lead to satisfactory results.
I’d be happy to work on incorporating missingness into the PyMC-BART library. Would this be a welcome feature? Would someone who has some experience with the code base be available to point me to the relevant parts of the library and provide me with some light supervision?

For reference, this is the baseline model I’m trying to fit:

split_rules = [
pmb.OneHotSplitRule
if (c in [‘geouid’, ‘toronto’] or c.endswith(‘_isna’))
else pmb.ContinuousSplitRule
for c in X_train.columns
]

sampling_dict = {
“tune”: 1000,
“draws”: 4000,
“discard_tuned_samples”: True,
“compute_convergence_checks”: True,
“chains”: 4,
“cores”: 4,
“target_accept”: 0.99,
“random_seed”: RANDOM_SEED,
“nuts_sampler_kwargs”:{
“max_depth”: 15
}
}

with pm.Model() as model_norm:
X_data = pm.Data(“X_data”, X_train)
Y_data = pm.Data(“Y_data”, Y_train)
μ = pmb.BART(“μ”, X_data, Y_data, m=300, alpha=0.95, beta=2,
split_rules = split_rules)
s = pm.HalfNormal(“sigma”, sigma=0.05)
y = pm.Normal(
“y”,
mu=μ,
sigma=s,
observed=Y_data
)
trace_norm = pm.sample(**sampling_dict)

As you can see from the split_rules, I’ve introduced some dummy NA columns to track missingness. This did improve convergence a bit, but didn’t resolve it. I’ve tried a lot of other things, from different priors and hyperparameters, to various imputation methods and hierarchical models, but I can’t get this to really converge with PyMC, whereas it fits very smoothly with bartMachine.

ricardoV94 · August 6, 2025, 5:49am

CC @aloctavodia

aloctavodia · August 6, 2025, 6:48am

Hi @maybemkl

Currently, PyMC-BART filters out missing values (see pymc-bart/pymc_bart/pgbart.py at 941d258e679e2c57b0571dc953e6bea657706fdb · pymc-devs/pymc-bart · GitHub). A better way to handle missingness is a nice feature; happy to guide you.
Have you tried running your model without OneHotSplitRule, just the default splitting rule? At least for some tests I did in the past, I got better results with the default ones.

Topic		Replies	Views
Could Someone Give me Advice for Handling Missing Data in Bayesian Modeling with PyMC? v5 theano , modeling	1	127	January 31, 2025
Demo Notebook for BART in PyMC Sharing	10	2385	May 20, 2022
BART Model convergence problem version agnostic bart	2	132	August 1, 2024
Non-smooth fiited fines from pymc-bart v5 bart	0	203	December 14, 2023
Implementation of MIA within BART for binary classification v5 bart	0	355	July 22, 2023

Missingness in BART?

Related topics