How can I implement hierarchies for my Gamma BART model?

Hi all, I’m new to BART so please go easy on me!

My data has two crossed categorical variables ‘Class’ and ‘Brand’. At present I’ve encoded them both as the mean of the target variable by categories (One for each class, one for each brand). I’d like to know if this is the best way to include categorical data, or if there’s another option.

The target variable is right skewed and positive (With a slightly different shape between ‘Class’ and ‘Brand’ categories), so I’ve used a Gamma distribution.

At the minute a lot of predictions move towards the global mean, and I don’t know if that is because of the data or how I’ve modelled it.

My model also takes quite a long time to run. Again, I don’t know if this is hardware, or how I’ve modelled.

I tried encoded the categories as integers, then using the split rules feature, but this inflated model training time and always yielded a broken chain. I’d be happy to put this back in if there’s a more stable workaround.

split_rules = [
        bart.SubsetSplitRule() if col in categorical_cols
        else bart.ContinuousSplitRule()
        for col in train_x.columns
    ]

Thank you!


with pm.Model() as bart_model:
    X = pm.Data("X", train_x)
    Y = pm.Data("Y", np.log(train_y)) 

    phi = pm.HalfNormal("phi", sigma=2)

    mu_bart = bart.BART(
        "mu_bart",
        X = X,
        Y = Y, 
        m = 500,
        alpha = 0.95,
        beta = 1.5,
        response = "constant",
    )

    mu    = pm.Deterministic("mu", pm.math.exp(mu_bart))

    y_obs = pm.Gamma(
        "y_obs",
        alpha=phi, 
        beta=phi / mu,
        observed=train_y,
        shape=X.shape[0])

    trace_bart = pm.sample(
        draws=2000,
        tune=1500, 
        chains=4, 
        cores=4,
        target_accept=0.95, 
        random_seed=1994,
        progressbar=True, 
        compute_convergence_checks=True,
    )

If you get rid of compute_convergence_checks that can help a little bit. I have also had some kind of problem with the progressbar in Jupyter notebooks that I couldn’t quite figure out but when I selected progressbar=False it runs much faster. It’s likely a problem with my Jupyter environment or something.

You have also selected m=500 which is a very large number of trees. I have never seen an improvement with m > 200.

Usually when models take a long time to sample the problem is the way you write the model. I haven’t really done any models on categorical data so I can’t help you much there.

HIGHLY RECOMMEND TURNING OFF THE PROGRESS BAR WHEN YOU’RE IN A NOTEBOOK

Sorry for the all caps. I started a thread on this AND reached out to databricks probably a year ago? The problem still seems to be around :frowning:

Back to the model: it’s a bit hard to understand why you have chosen to target encode here. It’s hard to imagine why this would speed up your sampling at face rather than just encoding the category. What happens if your split rule doesn’t fall nicely at the class probability within a node, and nodes between trees are NOT independent?

For category and brand, could you impose a hierarchy amongst those and then use bart on the predictors within each level of interest? It’s admittedly a bit hard to follow what your goal is; because you have target encoded categories that seem to be natural choices for subgroups within a larger populations i.e. you would have a seperate phi for each population, and beta by construction. Have I read this correct?

On the note of integer coding: you are explicitly telling telling your bart model that there is some intrinsic ordering of categories, rather than just being distinct categories. That doesn’t sound like your goal and will throw you a wrench if it is (think about it, does splitting on category 3 mean anything? What if our categories are something like sales in a catalog, and category 3 is aquatics and 5 is electronics)

Thanks, I’ve dropped the trees and found performance does level off at ~100. The progress bar has made a huge difference!