Multi-class BART Model Assistance

Hello,

I’ve been looking into various use-cases for BART models and in particular have read through the Ch7 example in Bayesian Modeling and Computation in Python as well as the similar example notebook. I also looked at this binary classification example.

I was having trouble getting a binary classification model to work because I was unable to recover the labels after pmx.bart.predict for an out-of-sample (hold-out) data set.

My data consists of five (5) variables and a binary response variable.

import pymc as pm
import pymc_experimental as pmx
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_vals, test_size=0.25, random_state=42, stratify= y_vals)

with pm.Model() as bart_model:
    μ = pmx.BART("μ", X= X_train, Y= y_train, m= 50)
    θ = pm.Deterministic('θ', pm.math.sigmoid(μ))
    y = pm.Bernoulli("y", p= θ, observed= y_train)

with bart_model:
    trace = pm.sample()
    trace.extend(pm.sample_posterior_predictive(trace))
    predtest = pmx.bart.predict(trace, rng, X= X_test)

After I use pmx.bart.predict on the out-of-sample data, the recovered values look like:

array([-6.93663277, -8.30200875, -7.18379557, ..., -5.97973494,
       -6.49198275, -7.58192513])

However, after I modify the pmx.bart.predict source code with return sigmoid(pred), the result is more like I would expect:

array([0.00398453, 0.0002981 , 0.00223829, ..., 0.00137673, 0.00213677,
       0.00066914])

Moving on to my actual application, I have again have a dataset with five (5) variables and a response variable with five classes array([0, 1, 2, 3, 4, 5]. So, I tried to extend the above to a multi-class problem using a softmax function.

with pm.Model() as bart_model:
    μ = pmx.BART("μ", X= X_train, Y= y_train, m= 50)
    θ = pm.math.softmax(μ)
    y = pm.Categorical("y", p= θ, observed= y_train)


with bart_model:
    trace = pm.sample()
    trace.extend(pm.sample_posterior_predictive(trace))

Unfortunately, when I check the posterior_predictive, it doesn’t return the ‘y’ class values. Instead, It returns an array of single y values that are obviously not the classes or softmax output.

trace.posterior_predictive.y.values
array([[[3841, 2703, 4008, ..., 2289,  426, 3936],
        [6172, 1810, 7215, ..., 1460, 3219, 4301],
        [3542, 3660, 2105, ..., 5984,  230, 5654],
        ...,
        [7081, 2198, 5241, ..., 5605, 7490, 4982],
        [7121, 6629, 4655, ..., 2991, 7219, 1199],
        [ 975, 4058, 3640, ..., 5188, 6736, 6479]],
        ...,

And the subsequent prediction on hold-out data:

with bart_model:
    predtest = pmx.bart.predict(trace, rng, X= X_test)

Returns an array with results that do not appear to have any class structure.

array([[[-58.23271276],
        [-58.79261905],
        [-56.67507199],
        ...,
        [-57.04217796],
        [-62.01896868],
        [-56.81891283]]])

Any help here would be greatly appreciated! Thanks in advance.

CC @aloctavodia

We have recently added a shape/size argument to BART. You can do something like:

    μ = pmx.BART("μ", X= X_train, Y= y_train, m= 50, shape=(num_categories, num_observations))

Thank you - I tried the below, but it did not work.

with pm.Model(coords=COORDS) as bart_model:
    μ = pmx.BART("μ", X= X_train, Y= y_train, m= 50, shape=(6, 7500))
    #θ = pm.Deterministic('θ', pm.math.softmax(μ))
    θ = pm.math.softmax(μ)
    y = pm.Categorical("y", p= θ, observed= y_train)

Which threw a really long error summarized with:

AssertionError: Could not broadcast dimensions

Are you sure you have the last version of pymc-experimental? I tried with a silly example and it works, in my example y_train has shape (100,2) and Bart shape (2, 100). I will try with a more realistic example, it will be a good exercise to see how BART works with categorical distributions.

I’m still fairly new to this so I’m sure it’s a shape issue on my side (I installed pymc_experimental this weekend). My y_train shape is (7500, ) but has 6 different integer categories. I think I need to encode y_train.

The following example seems to work.

X = np.linspace(0, 1, 120)[:,None]
Y = np.repeat([0, 1, 2], 40)
plt.plot(X[:,0], Y, ".")

cat

with pm.Model() as bart_model:
    μ = pmx.BART("μ", X=X, Y=Y, m=50, shape=(3, 120))
    θ = pm.Deterministic('θ', pm.math.softmax(μ, axis=0))
    y = pm.Categorical("y", p=θ.T, observed=Y)
    idata = pm.sample()

Now we check the results

# get the posterior mean of θ
posterior_mean = az.extract_dataset(idata, var_names='θ').mean("sample")
#using the posterior_mean generate posterior_predictive samples (syntethic data)
new_y = [np.random.multinomial(n=1, 
         pvals=posterior_mean.sel({"θ_dim_1":i}).values).argmax() for i in range(120)]

plt.plot(X[:,0], new_y, ".")

cat_post

1 Like

Thank you, it is working now with a slight modification of adding θ to posterior_mean.sel({"θ_dim_1":i}).θ.values

posterior_mean = az.extract_dataset(trace, var_names='θ').mean("sample")
new_y = [np.random.multinomial(n=1, pvals=posterior_mean.sel({"θ_dim_1":i}).θ.values).argmax() for i in range(7500)]

I was interested in using predicting out-of-sample data, but using pmx.bart.predict doesn’t seem to be doing a good job with unbalanced classes.

Right, the difference is that I am using the developmental version of ArviZ. Do you have an example that you could share, even if made up. I have not tested BART with categorical likelihoods, it may need some internal tweak, like what we currently do for binomial data.

Note: Also take into account that pmx.bart.predict is making predictions for the values of μ, so if you want to obtain observations you need first need to apply the softmax function and then the categorical distribution, something like:

pvals = pm.math.softmax(pmx.bart.predict(idata, rng, X=X_new)[0], axis=1).eval()
new_y = [np.random.multinomial(n=1, pvals=pvals[i]).argmax() for i in range(120)]
plt.hist(new_y, alpha=0.5)
plt.hist(Y, alpha=0.5)

cat_pred