Modeling with count data as predictors and continuous as outcome variable in pymc

Hello Justin,

Small warning up front: I have never worked with any Baseball data, and don’t even really know the basic rules, I hope that won’t show too much. :slight_smile:

I see you have tried quite a few things, I’m not sure I can always follow completely however. Maybe it helps if I try to sketch how I might (perhaps naively) start to model a dataset like this, and you can clarify if that is what you want, or why it doesn’t do what you are after?

It sounds to me like a partially pooled regression might be a good fit, so something like

exit_velocity ~ 1 + type_of_hit + (1|batter)

or possibly

log(exit_velocity) ~ 1 + type_of_hit + (1|batter)

And I’d code that along those lines (not tested):

hit_type_idx, hit_types  = pd.factorize(data["hit_type"])
batter_idx, batters  = pd.factorize(data["batter"])
coords = {
    "hit_type": hit_types,
    "batter": batters,
    "observation": data.index,
}

with pm.Model(coords=coords) as model:
    intercept = pm.Normal("intercept", sigma=100)

    hit_type_effect = pm.ZeroSumNormal("hit_type_effect", sigma=20, dims="hit_type")

    # Maybe a centered parametrization might actually be better than this uncentered one,
    # if there are a lot of batters in the dataset
    raw = pm.Normal("batter_raw", dims="batter")
    sd = pm.HalfNormal("batter_sigma", sigma=20)
    batter_effect = pm.Deterministic("batter_effect", sd * raw, dims="batter")

    # Centered parametrization would look like this
    # sd = pm.HalfNormal("batter_sigma", sigma=20)
    # batter_effect f pm.Normal("batter_effect", mu=0, sigma=sd, dims="batter")

    mu = (
        intercept
        + hit_type_effect[hit_type_idx]
        + batter_effect[batter_idx]
    )
    sigma = pm.HalfNormal("exit_velocity", sigma=20)
    pm.Normal("y", mu=mu, sigma=sigma, observed=data["exit_velocity"], dims="observation")

So that by assuming that the batter effects are drawn from a normal distribution, we get some shrinkage.