I’m trying to figure out how to use the recently added BART model for a binary classification task, using the breast cancer wisconsin dataset.
I use a train/test split for testing the predictive performance of the fitted model on unseen data, therefore I defined
X as a shared variable:
>>> X_shared = theano.shared(X_train)
The shape of the datasets are as follows:
>>> X_train.shape, Y_train.shape, X_test.shape, Y_test.shape ((426, 30), (426,), (143, 30), (143,))
with pm.Model() as model: x = pm.BART('x', X_shared.get_value(), Y_train) y = pm.Bernoulli('y', p=pm.math.sigmoid(x), observed=Y_train) trace = pm.sample()
On plugging in the test data I noticed that the shape of the posterior wasn’t updated (still having the same number of samples from the training data):
>>> with model: ... ppc = pm.sample_posterior_predictive(trace) >>> posterior = ppc.get('y') >>> posterior.shape (2000, 426)
The questions I’m struggling with are:
Do I use the
BARTmodel the correct way for classification? How could I improve my model?
In comparison to using a
RandomForestClassifierwith default hyper-params (acc. 96%), the results of this model seem to be no better than random guessing (acc. 57%).
- Why wasn’t the shape of the posterior updated to
In order to reproduce my case, here’s the complete Gist.
Many thanks in advance!