Hi all, I’m new to BART so please go easy on me!
My data has two crossed categorical variables ‘Class’ and ‘Brand’. At present I’ve encoded them both as the mean of the target variable by categories (One for each class, one for each brand). I’d like to know if this is the best way to include categorical data, or if there’s another option.
The target variable is right skewed and positive (With a slightly different shape between ‘Class’ and ‘Brand’ categories), so I’ve used a Gamma distribution.
At the minute a lot of predictions move towards the global mean, and I don’t know if that is because of the data or how I’ve modelled it.
My model also takes quite a long time to run. Again, I don’t know if this is hardware, or how I’ve modelled.
I tried encoded the categories as integers, then using the split rules feature, but this inflated model training time and always yielded a broken chain. I’d be happy to put this back in if there’s a more stable workaround.
split_rules = [
bart.SubsetSplitRule() if col in categorical_cols
else bart.ContinuousSplitRule()
for col in train_x.columns
]
Thank you!
with pm.Model() as bart_model:
X = pm.Data("X", train_x)
Y = pm.Data("Y", np.log(train_y))
phi = pm.HalfNormal("phi", sigma=2)
mu_bart = bart.BART(
"mu_bart",
X = X,
Y = Y,
m = 500,
alpha = 0.95,
beta = 1.5,
response = "constant",
)
mu = pm.Deterministic("mu", pm.math.exp(mu_bart))
y_obs = pm.Gamma(
"y_obs",
alpha=phi,
beta=phi / mu,
observed=train_y,
shape=X.shape[0])
trace_bart = pm.sample(
draws=2000,
tune=1500,
chains=4,
cores=4,
target_accept=0.95,
random_seed=1994,
progressbar=True,
compute_convergence_checks=True,
)