Need help with setup for BART model for binary classification

markben · November 14, 2022, 6:58pm

import arviz as az
import pymc as pm # 0.4.3
import pymc_bart as pmb # 0.2.1
import pandas as pd

df_train_features = # select data here, about 850k samples
df_train_labels = # select labels here, highly unbalanced with about 98% zeros

with pm.Model() as model_bart:
	mu = pmb.BART("mu", df_train_features, df_train_labels, m=200)
	theta = pm.Deterministic("theta", pm.math.invprobit(mu))
	y = pm.Bernoulli("y", p=theta, observed=df_train_labels)
	idata = pm.sample(random_seed=0, tune=200)

When I run the above code, the output of Jupyter notebook shows the progress like below:

Multiprocess sampling (4 chains in 4 jobs)
PGBART: [mu]
<progress bar> 100.00% [4800/4800 <time> Sampling 4 chains, 0 divergences]
Sampling 4 chains for 200 tune and 1_000 draw iterations (800 + 4_000 draws total) took <time> seconds.

So I have run into a few issues:

If I use the entire training data as mentioned above, at some point during the progress bar, the kernel just died without any further warning nor error. This happened a few times when progress was at 50%, 80% and even 100%. When I tried with just 1% of the data (so 8500 rows), then it worked fine. How can I force verbose output to see the precise error message?
Is the above model setup correct for BART classification? I am basing it against chapter 4 of the original BART paper.
In an online setting, assuming the above code doesn’t run into any error, how can I feed further training data into the model without retraining it from scratch? Which variable above should I pickle? And how do I resume training?
Relating to question 1 and 3, if it’s because of memory constraintI split up the data into chunks of 10% and do the training 10 times sequentially?
I see lots of mentions of steps and potential for NUTS sampler in this forum from various google search queries. Is it something I should concern myself with, and change the above code accordingly?

My apology if my questions are naive. I am completely new to both PyMC and Bayesian inference in general so I am still learning.

Even if you have insight into just one of the questions, I’d really appreciate your inputs.

Topic		Replies	Views
Demo Notebook for BART in PyMC Sharing	10	2382	May 20, 2022
Binary classification example using the recently added BART model Questions	3	1560	January 13, 2021
Problem to load trained BART model for prediction v5 bart	5	775	December 6, 2023
Save and Load a BART model v5 bart	8	637	August 22, 2024
EOF Error with PyMC BART on M2 Mac bart	2	429	August 22, 2023

Need help with setup for BART model for binary classification

Related topics