I have been trying to take better advantage of using coords for specifying dimensions in my models. I am really enjoying using xarray
but am still a little confused on how best to cleanly specify multiple regression models with coords. I am currently debugging my model using some baseball data. For a straight-forward problem, I am simply trying to model the velocity of a hit ball using data from the thrown pitch. I want to introduce a hierarchical structure in a way that allows me to be flexible about the number of predictors I use in the model. The data frame (called sd
) has ~230k rows (individual pitches) across 906 individual pitchers with varying number of pitches per pitcher.
My current approach is to include an intercept term by using sm.add_constant
and then creating a set of weights, betas
, which includes this intercept. I could easily do that separately, but I donât think that is the source of my problem. I then simply multiply the weights (indexed by player) and the data using pm.math.dot
.
# Can/will change
predictors = ['release_speed', 'pfx_z', 'pfx_x','release_spin_rate']
sd = sd.dropna(subset=predictors).reset_index(drop=True)
# Factorize pitcher ID
pitcher_factorized, pitcher_list = sd.pitcher.factorize()
sd["pitcher_factorized"] = pitcher_factorized
# Get data and add column of 1's
data = sm.add_constant(sd[predictors]).rename(columns={"const":"intercept"}).astype("float")
# Set up coords
coords = {
"player" : pitcher_list,
"obs_id" : data.index.tolist(),
"predictors" : data.columns.tolist()
}
with pm.Model(coords=coords) as model:
# Define data - not pretty
X = pm.MutableData("X", data)
y = pm.MutableData("y", sd.launch_speed.to_numpy().astype("float"))
player_idx = pm.MutableData("player_idx", sd.pitcher_factorized.tolist())
# Priors on betas
mu = pm.Normal("mu", mu=0.0, sigma=10)
sigma = pm.HalfNormal("sigma", 1)
# Define priors on each parameter
beta = pm.Normal("beta",mu=mu,sigma=sigma,dims=("predictors","player"))
# Model error
err = pm.Exponential("err",1)
# Expected value
y_hat = pm.math.dot(beta[player_idx],X)
# # Likelihood
y_likelihood = pm.Normal("y_likelihood",mu=y_hat, sigma=err, observed=y)
I am getting shape issues and I am have not been able to fix it. Should my priors, mu
and sigma
, also be multidimensional or is it inferred that each beta
is centered at mu
, for example? Am I setting up the coords
correctly? Iâve seen this thread, I am still a bit confused after trying to debug my own problem.
My second question is about best practices. Is there a better way to set this code up that allows for generalization but also adheres to some set of practices? I have had success with hierarchical models in the past where I defined each variable explicitly, but it gets messy and cumbersome.
Thanks for the help!