Modeling with count data as predictors and continuous as outcome variable in pymc

Disclaimer 1: I posted this question on statsexchange but I feel like it may be better suited here since its very specific to Bayesian modeling in pymc. Please let me know if it is not appropriate for this forum and I will take it down.

Disclaimer 2: This is a long explanation but I feel like it was needed to give a thorough description of my problem.

I am relatively new to Bayesian modeling and I am trying to wrap my head around a particular regression approach using count data. I understand how to model a GLM with a count variable as the dependent, such as using Poisson e.g. or Negative binomial. What isn’t clear to me is how to implement this in the reverse way. I have been playing with baseball data because it is something I enjoy and it is the most intuitive for me. Basically my problem is this, I want to model a player’s exit velocity data based on the number of hit outcomes for several types of hits. The reason I am doing it this way is because I am assuming that each hit type has a correlation with exit velocity, which is generally true but not perfect. This is by no means a good baseball question. It just sets up the problem in the way I want to try to solve. There are two issues:

  1. How do I model my priors so that I can have count distributions for the predictors with a likelihood on continuous data. I searched quite a bit and did find several solutions (example) like this that basically say that count data is ok to be treated continuous, but that is not what I want to do.
  2. I want to make inferences for each batter. Some have many observations (total hit counts) while others have few. How can I add a hierarchical structure to infer the distribution for each batter, including those with few or many observations.

I will outline what I have done so far. By the way I am using python with statsmodels and pymc for reference.

My original pandas dataframe looks like this:

batter type_of_hit exit_velocity
1 fly_ball 82.6
1 ground_ball 92.1
10 ground_ball 78.9
10 line_drive 94.6
N hit_N velocity_N

I can model this by converting the hits to categorical and just regress exit_velocity on hit type and batter.

exit_velocity ~ C(batter)*C(type_of_hit)

With the data having so many batters and varying number of hits per batter, this is a messy approach, especially since the number of total hits can be from ~5-300 per batter.

Another approach is to group by the batters, count the number of hits per type, and compute the \mu_{\text{exit_velocity}} \pm \sigma for each batter.

batter fly_ball ground_ball line_drive mean_exit_velocity std_exit_velocity
1 2 13 2 3 81.072924
2 3 5 0 1 92.482074
3 28 70 38 9 89.839477

Again, I can simply perform a linear regression:

exit_velocity ~ ground_ball + line_drive + fly_ball.

The issue is the same that I have very widely distributed total counts across batters. To remedy this, I played with regressing each player to the mean using two approaches. The first is the way that Tom Tango does it for wOBA in “The Book”:

\mu_{\text{new}} = \frac{\mu_{total} / \sigma^2_{total} + \mu_{batter} / \sigma^2_{batter}}{1 / \sigma^2_{total} + 1 / \sigma^2_{batter}}

The second way I did it was to use an Empirical Bayes approach from David Robinson’s book (his blog posts that became the book), method is actually in Ch.10 of the book which is not in the posts). Basically, I modeled the counts as a Dirchlet-Multinomial. After obtaining the \alpha values, I simply added them to each to get an estimate of total number of hits per player that is biased toward the league prior.

import pymc as pm
import arviz as az

# Get data as numpy array 
hit_types = ["ground_ball","line_drive","fly_ball"]
counts = data[hit_types].to_numpy()

# Reparamerization of the Dirichlet-Multinomial
with pm.Model() as model:
    frac = pm.Dirichlet("frac", a=np.ones(len(hit_types)))
    conc = pm.Lognormal("conc", mu=1, sigma=1)
    counts = pm.DirichletMultinomial("counts", a=frac*conc,n=counts.sum(1),shape=counts.shape, observed=counts)
    trace = pm.sample(2000, chains=4, return_inferencedata=True)

    az.plot_trace(data=trace, var_names=["frac", "conc"])

    alpha = trace["posterior"]["frac"].mean(("chain","draw")).to_numpy() * trace["posterior"]["conc"].mean(("chain","draw")).to_numpy()

It is an interesting approach and does kind of work, but it definitely has a huge shrinkage effect overall. The reality is that the mean exit velocity across all players is not that widely distributed, although there are a few notable exceptions (e.g. Stanton). However, this is where I feel that a full Bayesian approach will shine, where I can estimate players with low and high number of observations while capturing the uncertainty in each player’s estimate. What I can’t figure it out is how to model the priors.

I was thinking something like this:

with pm.Model() as model:

   # Priors:
   # SEE BELOW

   # Mean and std of likelihood function
   sigma_L = pm.HalfNormal("sigma_L",sigma=10)
   mu_L = pm.Deterministic("mul_L",b0 + b1*h1 + b2*h2 + b3*h3)

   # Likelihood
   obs = pm.Normal("mu",mu=mu_L,sigma=sigma_L,observed=data["exit_velocity"].to_numpy())

I was thinking I could model the priors as count disributions. The variance is >> mean and there are lots of zeros so probably zero inflated negative binomial.

pm.ZeroInflatedNegativeBinomial(name, psi, mu=None, alpha=None, p=None, n=None)

If I consider only players with a sufficient number of observations (say > 100), the zeros go away and its just negative binomial (NB).

pm.NegativeBinomial(name, mu=None, alpha=None, p=None, n=None)

For this, I need an expected proportion (psi; only for zero inflated), Poisson param (mu), and alpha for each hit type. It seems like I need to model them jointly so do I need to do something like the Dirichlet-Multinomial for expected proportions? I am also having a hard time figuring out if I model the data (h1,h2,h3) or the weights (b0..b3) as NB.

I would like to add a hierarchical structure over players so that given a number of counts per hit type, I can estimate an average exit velocity for players with many or few observations.

Like I said, I am pretty new to the Bayesian framework so I appreciate your help with this issue.

Hello Justin,

Small warning up front: I have never worked with any Baseball data, and don’t even really know the basic rules, I hope that won’t show too much. :slight_smile:

I see you have tried quite a few things, I’m not sure I can always follow completely however. Maybe it helps if I try to sketch how I might (perhaps naively) start to model a dataset like this, and you can clarify if that is what you want, or why it doesn’t do what you are after?

It sounds to me like a partially pooled regression might be a good fit, so something like

exit_velocity ~ 1 + type_of_hit + (1|batter)

or possibly

log(exit_velocity) ~ 1 + type_of_hit + (1|batter)

And I’d code that along those lines (not tested):

hit_type_idx, hit_types  = pd.factorize(data["hit_type"])
batter_idx, batters  = pd.factorize(data["batter"])
coords = {
    "hit_type": hit_types,
    "batter": batters,
    "observation": data.index,
}

with pm.Model(coords=coords) as model:
    intercept = pm.Normal("intercept", sigma=100)

    hit_type_effect = pm.ZeroSumNormal("hit_type_effect", sigma=20, dims="hit_type")

    # Maybe a centered parametrization might actually be better than this uncentered one,
    # if there are a lot of batters in the dataset
    raw = pm.Normal("batter_raw", dims="batter")
    sd = pm.HalfNormal("batter_sigma", sigma=20)
    batter_effect = pm.Deterministic("batter_effect", sd * raw, dims="batter")

    # Centered parametrization would look like this
    # sd = pm.HalfNormal("batter_sigma", sigma=20)
    # batter_effect f pm.Normal("batter_effect", mu=0, sigma=sd, dims="batter")

    mu = (
        intercept
        + hit_type_effect[hit_type_idx]
        + batter_effect[batter_idx]
    )
    sigma = pm.HalfNormal("exit_velocity", sigma=20)
    pm.Normal("y", mu=mu, sigma=sigma, observed=data["exit_velocity"], dims="observation")

So that by assuming that the batter effects are drawn from a normal distribution, we get some shrinkage.

EDIT: This is an edited post since I was able to answer some of my own questions.

Thanks so much for your reply! I know that the background I provided was a bit long and maybe confusing, but I believe you understood what I was asking.

Yes, I had in my mind a partially pooled regression, and I believe you captured this exactly in the equations. I was able to run the code and perform sampling for a small number of samples for now (after updating to V4.3). I have ~800 unique batters in the dataset so I decided to use the centered parameterization as you suggested. I was trying to perform prior predictive checks and I wanted to make sure I am correctly understanding how to reconstruct the equation.

hit_type_idx, hit_types  = pd.factorize(data_nonan["hittype"])
batter_idx, batters  = pd.factorize(data_nonan["batter"])
coords = {
    "hit_type": hit_types,
    "batter": batters,
    "observation": data_nonan.index,
}

# Mask nan values
# exit_velo = np.ma.masked_array(data["speed_A"].values,mask=data["speed_A"].isna().values)

with pm.Model(coords=coords) as model:
    

    intercept = pm.Normal("intercept", sigma=100)
    hit_type_effect = pm.ZeroSumNormal("hit_type_effect", sigma=20, dims="hit_type")

    # Centered parametrization
    sd = pm.HalfNormal("batter_sigma", sigma=20)
    batter_effect =  pm.Normal("batter_effect", mu=0, sigma=sd, dims="batter")
    
    # Observed data
    mu = intercept + hit_type_effect[hit_type_idx] + batter_effect[batter_idx]
    sigma = pm.HalfNormal("sigma", sigma=20)
    pm.Normal("y", mu=mu, sigma=sigma, observed=data_nonan["speed_A"].values, dims="observation")

    idata = pm.sample_prior_predictive(samples=50)

prior = idata["prior"]

y = prior["intercept"] + prior["batter_effect"] + prior["hit_type_effect"]*[1,2,3,4]

Am I correct in my understanding that the intercept is offset by the batter effect, which results in a batter specific intercept value? From there, the batter intercept value is offset by the amount from each hit type. So on the y-axis is the continuous measurement and the x-axis four values, e.g., 1,2,3,4, with one for each pitch category? Something similar to the radon study , except instead of two levels (floor and basement), it would be four levels for each pitch type (e.g.)?