Hello Justin,
Small warning up front: I have never worked with any Baseball data, and don’t even really know the basic rules, I hope that won’t show too much.
I see you have tried quite a few things, I’m not sure I can always follow completely however. Maybe it helps if I try to sketch how I might (perhaps naively) start to model a dataset like this, and you can clarify if that is what you want, or why it doesn’t do what you are after?
It sounds to me like a partially pooled regression might be a good fit, so something like
exit_velocity ~ 1 + type_of_hit + (1|batter)
or possibly
log(exit_velocity) ~ 1 + type_of_hit + (1|batter)
And I’d code that along those lines (not tested):
hit_type_idx, hit_types = pd.factorize(data["hit_type"])
batter_idx, batters = pd.factorize(data["batter"])
coords = {
"hit_type": hit_types,
"batter": batters,
"observation": data.index,
}
with pm.Model(coords=coords) as model:
intercept = pm.Normal("intercept", sigma=100)
hit_type_effect = pm.ZeroSumNormal("hit_type_effect", sigma=20, dims="hit_type")
# Maybe a centered parametrization might actually be better than this uncentered one,
# if there are a lot of batters in the dataset
raw = pm.Normal("batter_raw", dims="batter")
sd = pm.HalfNormal("batter_sigma", sigma=20)
batter_effect = pm.Deterministic("batter_effect", sd * raw, dims="batter")
# Centered parametrization would look like this
# sd = pm.HalfNormal("batter_sigma", sigma=20)
# batter_effect f pm.Normal("batter_effect", mu=0, sigma=sd, dims="batter")
mu = (
intercept
+ hit_type_effect[hit_type_idx]
+ batter_effect[batter_idx]
)
sigma = pm.HalfNormal("exit_velocity", sigma=20)
pm.Normal("y", mu=mu, sigma=sigma, observed=data["exit_velocity"], dims="observation")
So that by assuming that the batter effects are drawn from a normal distribution, we get some shrinkage.