Seeking suggestion to improve low accuracy in BART classification

Cindyyy · January 18, 2024, 2:26am

Hello everyone,

I’ve been experimenting with using BART for a classification task and I’m encountering some challenges with the accuracy of my predictions. My issue is that the accuracy score for the test dataset is significantly lower than what I anticipated, especially when I compare it with the results from a knn algorithm I used previously. The accuracy for the training dataset is about 0.8, but the accuracy for the test dataset is only about 0.388, while the KNN can reach to 0.77.

df_train_cleaned = pd.read_csv("cleaned_pipeline.csv")

train = df_train_cleaned

X = train.drop('Credit_Score', axis=1)[:1000]
y = train['Credit_Score'][:1000]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

categories = y_train.unique()
num_categories = len(categories)
num_obs = len(y_train)
with pm.Model() as bart_model:
    x = pm.MutableData("x", X_train)
    μ = pmb.BART("μ", X=x, Y=y_train, m=100, shape=(num_categories, num_obs))
    θ = pm.Deterministic('θ', pm.math.softmax(μ, axis=0))
    y_obs = pm.Categorical("y_obs", p=θ.T, observed=y_train)

    idata = pm.sample(draws=500, tune=10000, chains=8, cores=8, return_inferencedata=True)

y_train_pred = idata.posterior["θ"].mean(dim=["draw", "chain"]).argmax(dim="θ_dim_0")

accuracy = accuracy_score(y_train, y_train_pred)
print(f"accuracy: {accuracy}")

with bart_model:
    pm.set_data({
            "x": X_test
        })
    pm.sample_posterior_predictive(
            idata, extend_inferencedata=True, predictions=True
    )
# the argmax is specific to this classification problem
yhat = idata.posterior["θ"].mean(dim=["draw", "chain"]).argmax(dim="θ_dim_0")
accuracy = accuracy_score(y_test, yhat)
print(f"accuracy: {accuracy}")

One thing I’ve noticed is that there aren’t any examples in the documentation specifically about using BART for classification tasks, which makes me wonder if I’m missing something crucial in my approach. If anyone has experience with this or can point out any potential missteps in my code, that would be immensely helpful.

I’m aware that using categoricals and a softmax function for a binary outcome might seem like overkill, but my plan is to extend this project to handle a multiple category dataset in the future, which is why I’ve set it up this way. Still, I’m puzzled by the low accuracy score in its current state.

Another question I have is regarding the recovery of probabilities for the prediction dataset. I’m not entirely sure how to approach this with BART, so any advice or examples on this would be greatly appreciated.

Looking forward to any suggestions or insights you all might have. Thanks in advance for your help!

iavicenna · February 5, 2024, 3:37pm

I am not experienced with BART but in case you still haven’t seen it, there is a page with couple different BART models here:

ps: just checked this and it seems the only examples are NegativeBinomial regressions, switch-point models etc so does not help your question much I am afraid

ricardoV94 · February 5, 2024, 11:37pm

CC @aloctavodia

aloctavodia · February 6, 2024, 2:09pm

Hi @Cindyyy thanks for reporting. I will need a couple of days to check this in detail and provide an answer. Could you please provide a CSV file to reproduce your example?

Topic		Replies	Views
Multi-class BART Model Assistance v5 modeling	8	1221	July 6, 2022
Binary classification example using the recently added BART model Questions	3	1575	January 13, 2021
Categorical BART with Out of Sample Predictions Sharing prediction , bart	1	136	October 11, 2024
Making test set prediction with BART Questions	2	598	May 4, 2021
Need help with setup for BART model for binary classification v5 modeling	0	645	November 14, 2022

Seeking suggestion to improve low accuracy in BART classification

Related topics