Disclaimer 1: I posted this question on statsexchange but I feel like it may be better suited here since its very specific to Bayesian modeling in pymc. Please let me know if it is not appropriate for this forum and I will take it down.
Disclaimer 2: This is a long explanation but I feel like it was needed to give a thorough description of my problem.
I am relatively new to Bayesian modeling and I am trying to wrap my head around a particular regression approach using count data. I understand how to model a GLM with a count variable as the dependent, such as using Poisson e.g. or Negative binomial. What isn’t clear to me is how to implement this in the reverse way. I have been playing with baseball data because it is something I enjoy and it is the most intuitive for me. Basically my problem is this, I want to model a player’s exit velocity data based on the number of hit outcomes for several types of hits. The reason I am doing it this way is because I am assuming that each hit type has a correlation with exit velocity, which is generally true but not perfect. This is by no means a good baseball question. It just sets up the problem in the way I want to try to solve. There are two issues:
- How do I model my priors so that I can have count distributions for the predictors with a likelihood on continuous data. I searched quite a bit and did find several solutions (example) like this that basically say that count data is ok to be treated continuous, but that is not what I want to do.
- I want to make inferences for each batter. Some have many observations (total hit counts) while others have few. How can I add a hierarchical structure to infer the distribution for each batter, including those with few or many observations.
I will outline what I have done so far. By the way I am using python with statsmodels
and pymc
for reference.
My original pandas dataframe looks like this:
batter | type_of_hit | exit_velocity |
---|---|---|
1 | fly_ball | 82.6 |
1 | ground_ball | 92.1 |
… | … | … |
10 | ground_ball | 78.9 |
10 | line_drive | 94.6 |
… | … | … |
N | hit_N | velocity_N |
I can model this by converting the hits to categorical and just regress exit_velocity on hit type and batter.
exit_velocity ~ C(batter)*C(type_of_hit)
With the data having so many batters and varying number of hits per batter, this is a messy approach, especially since the number of total hits can be from ~5-300 per batter.
Another approach is to group by the batters, count the number of hits per type, and compute the \mu_{\text{exit_velocity}} \pm \sigma for each batter.
batter | fly_ball | ground_ball | line_drive | mean_exit_velocity | std_exit_velocity |
---|---|---|---|---|---|
1 | 2 | 13 | 2 | 3 | 81.072924 |
2 | 3 | 5 | 0 | 1 | 92.482074 |
3 | 28 | 70 | 38 | 9 | 89.839477 |
… | … | … | … | … | … |
Again, I can simply perform a linear regression:
exit_velocity ~ ground_ball + line_drive + fly_ball
.
The issue is the same that I have very widely distributed total counts across batters. To remedy this, I played with regressing each player to the mean using two approaches. The first is the way that Tom Tango does it for wOBA in “The Book”:
\mu_{\text{new}} = \frac{\mu_{total} / \sigma^2_{total} + \mu_{batter} / \sigma^2_{batter}}{1 / \sigma^2_{total} + 1 / \sigma^2_{batter}}
The second way I did it was to use an Empirical Bayes approach from David Robinson’s book (his blog posts that became the book), method is actually in Ch.10 of the book which is not in the posts). Basically, I modeled the counts as a Dirchlet-Multinomial. After obtaining the \alpha values, I simply added them to each to get an estimate of total number of hits per player that is biased toward the league prior.
import pymc as pm
import arviz as az
# Get data as numpy array
hit_types = ["ground_ball","line_drive","fly_ball"]
counts = data[hit_types].to_numpy()
# Reparamerization of the Dirichlet-Multinomial
with pm.Model() as model:
frac = pm.Dirichlet("frac", a=np.ones(len(hit_types)))
conc = pm.Lognormal("conc", mu=1, sigma=1)
counts = pm.DirichletMultinomial("counts", a=frac*conc,n=counts.sum(1),shape=counts.shape, observed=counts)
trace = pm.sample(2000, chains=4, return_inferencedata=True)
az.plot_trace(data=trace, var_names=["frac", "conc"])
alpha = trace["posterior"]["frac"].mean(("chain","draw")).to_numpy() * trace["posterior"]["conc"].mean(("chain","draw")).to_numpy()
It is an interesting approach and does kind of work, but it definitely has a huge shrinkage effect overall. The reality is that the mean exit velocity across all players is not that widely distributed, although there are a few notable exceptions (e.g. Stanton). However, this is where I feel that a full Bayesian approach will shine, where I can estimate players with low and high number of observations while capturing the uncertainty in each player’s estimate. What I can’t figure it out is how to model the priors.
I was thinking something like this:
with pm.Model() as model:
# Priors:
# SEE BELOW
# Mean and std of likelihood function
sigma_L = pm.HalfNormal("sigma_L",sigma=10)
mu_L = pm.Deterministic("mul_L",b0 + b1*h1 + b2*h2 + b3*h3)
# Likelihood
obs = pm.Normal("mu",mu=mu_L,sigma=sigma_L,observed=data["exit_velocity"].to_numpy())
I was thinking I could model the priors as count disributions. The variance is >> mean and there are lots of zeros so probably zero inflated negative binomial.
pm.ZeroInflatedNegativeBinomial(name, psi, mu=None, alpha=None, p=None, n=None)
If I consider only players with a sufficient number of observations (say > 100), the zeros go away and its just negative binomial (NB).
pm.NegativeBinomial(name, mu=None, alpha=None, p=None, n=None)
For this, I need an expected proportion (psi; only for zero inflated), Poisson param (mu), and alpha for each hit type. It seems like I need to model them jointly so do I need to do something like the Dirichlet-Multinomial for expected proportions? I am also having a hard time figuring out if I model the data (h1,h2,h3)
or the weights (b0..b3)
as NB.
I would like to add a hierarchical structure over players so that given a number of counts per hit type, I can estimate an average exit velocity for players with many or few observations.
Like I said, I am pretty new to the Bayesian framework so I appreciate your help with this issue.