Problem with coords/dims in hierarchical model

I have been trying to take better advantage of using coords for specifying dimensions in my models. I am really enjoying using xarray but am still a little confused on how best to cleanly specify multiple regression models with coords. I am currently debugging my model using some baseball data. For a straight-forward problem, I am simply trying to model the velocity of a hit ball using data from the thrown pitch. I want to introduce a hierarchical structure in a way that allows me to be flexible about the number of predictors I use in the model. The data frame (called sd) has ~230k rows (individual pitches) across 906 individual pitchers with varying number of pitches per pitcher.

My current approach is to include an intercept term by using sm.add_constant and then creating a set of weights, betas, which includes this intercept. I could easily do that separately, but I don’t think that is the source of my problem. I then simply multiply the weights (indexed by player) and the data using pm.math.dot.

# Can/will change
predictors = ['release_speed', 'pfx_z', 'pfx_x','release_spin_rate']
sd = sd.dropna(subset=predictors).reset_index(drop=True)

# Factorize pitcher ID
pitcher_factorized, pitcher_list = sd.pitcher.factorize()
sd["pitcher_factorized"] = pitcher_factorized

# Get data and add column of 1's
data = sm.add_constant(sd[predictors]).rename(columns={"const":"intercept"}).astype("float")

# Set up coords
coords = {
          "player"     : pitcher_list,
          "obs_id"     : data.index.tolist(), 
          "predictors" : data.columns.tolist()
          }

with pm.Model(coords=coords) as model:

    # Define data - not pretty
    X = pm.MutableData("X", data)
    y = pm.MutableData("y", sd.launch_speed.to_numpy().astype("float"))
    player_idx = pm.MutableData("player_idx", sd.pitcher_factorized.tolist())

    # Priors on betas
    mu    = pm.Normal("mu", mu=0.0, sigma=10)
    sigma = pm.HalfNormal("sigma", 1)

    # Define priors on each parameter
    beta = pm.Normal("beta",mu=mu,sigma=sigma,dims=("predictors","player"))
    
    # Model error
    err = pm.Exponential("err",1)

    # Expected value
    y_hat = pm.math.dot(beta[player_idx],X)

    # # Likelihood
    y_likelihood = pm.Normal("y_likelihood",mu=y_hat, sigma=err, observed=y)

I am getting shape issues and I am have not been able to fix it. Should my priors, mu and sigma, also be multidimensional or is it inferred that each beta is centered at mu, for example? Am I setting up the coords correctly? I’ve seen this thread, I am still a bit confused after trying to debug my own problem.

My second question is about best practices. Is there a better way to set this code up that allows for generalization but also adheres to some set of practices? I have had success with hierarchical models in the past where I defined each variable explicitly, but it gets messy and cumbersome.

Thanks for the help!

The first thing I see right away is that you’re indexing beta on the first axis with player_idx, but the dims of beta are predictors, player. I think it should be beta[:, player_idx].

Thinking about shapes a bit more, if you have k predictors, p players, and n rows of data, the output of pm.math.dot(beta[:, player_idx], X) will be (k, n), (n, k) -> (k, k), which isn’t right. I think, since you are expanding beta to be shape (k, n) with the player_idx, you can do the dot product “by hand” like this:

y_hat = (beta[:, player_idx].T * X).sum(axis=-1)

This will multiply each cell of each row by the correct beta for that variable, then sum across the rows, producing the usual linear projection. The output shape now should be (n, ).

One other little “optimization” i see is that you could use the .assign method of DataFrames instead of sm.add_constant to save a few keystrokes, as in data = sd[predictors].assign(intercept = 1.0), which should give the name and dtype you want on the constant.

For your priors, you currently have a “fully pooled” model, where all player-pitch tuples are assumed to come from the same underlying distribution. (edit: this is wrong, see below) You are also pooling all your parameters, so you assume that the value of the coefficient on release_speed comes from the same distribution as the coefficient on pfx_z, for example. You could split up the parameters by adding a predictors dimension to mu and sigma. In this case, you would still have full pooling of player-pitch tuples, but each feature would be treated independently.

As far as best practices go, I think what you have looks really clean. If you end up using non-centered parameterizations for beta that can be messy, so I sometimes write helper functions to handle that. You could also consider writing a function that handles model creation and returns the model object. That would make building and comparing a bunch of specifications cleaner.

3 Likes

Thanks so much for your reply! Your answer is very helpful.

as in data = sd[predictors].assign(intercept = 1.0) , which should give the name and dtype you want on the constant.

Good suggestion!

The first thing I see right away is that you’re indexing beta on the first axis with player_idx, but the dims of beta are predictors, player. I think it should be beta[:, player_idx].

Thinking about shapes a bit more, if you have k predictors, p players, and n rows of data, the output of pm.math.dot(beta[:, player_idx], X) will be (k, n), (n, k) -> (k, k), which isn’t right.

I guess this is where I am still getting a little confused with the indexing. Just to write it out (mainly for me to say it out loud), I am trying to use a linear model to weight the k predictors with weights, intercept, \beta_0, \beta_1, .... I want to assume that each player has their own weights, but those weights are drawn from some distribution (with an individual distribution for each feature). I can see where I am fully pooling from your comment:

You are also pooling all your parameters, so you assume that the value of the coefficient on release_speed comes from the same distribution as the coefficient on pfx_z , for example. You could split up the parameters by adding a predictors dimension to mu and sigma .

I could sense I had this wrong and this helps. What is not clear to me is the dimensions of beta.

beta = pm.Normal("beta",mu=mu,sigma=sigma,dims=("predictors","player"))

Am I correct in trying to specify that I should have \beta 's for each feature and for each player, making it dimension (k, p)? Does it matter in the pymc/aesera environment if its (k, p) or (p, k), or is that just a linear algebra problem? This is where I get confused, since my assumption is that the matrix beta is (k, p), but we index it using the player_idx variable, which is size n rows.

Lastly, going back to the comment about pooling, you mention:

You could split up the parameters by adding a predictors dimension to mu and sigma . In this case, you would still have full pooling of player-pitch tuples, but each feature would be treated independently.

What do you mean by player-pitch tuples? The parameters are fit for all observations (pitches) of a single pitcher, right?

Thanks!

By “pitcher-player tuple” i mean an observation in your dataset. I guess I could just say “pitch”, but I wanted to emphasize that there is the pitcher dimension on each row – I was thinking of a row as a pair of two ids (player_id, pitch_id), so (31, 0) would be the first pitch in the dataset thrown by player 31 (I don’t know if the ordering of pitches matters in your current model, but this framework lets you think about pitch ordering if you wanted to extend the model to account for time-in-game, where each row would become a 3-tuple of (player_id, game_id, pitch_number))

Your betas are correct, and it currently doesn’t matter what order you write things in, and you are correct to note that the shape of the beta matrix is just a matter of linear algebra convention. More on this in a second.

You should look at numpy fancy indexing to get a better sense of what player_idx is doing. If you pass an array of indexes to a numpy array (and pytensor/aseara follows numpy conventions wherever possible), it will, for each index in the indexing array, pick out the value at that index. A concrete example:

arr = np.array([3, 2, 5])
arr[np.array([0, 0, 0, 0])] 
>>> [3, 3, 3, 3]

Here the indexing array is (0, 0, 0, 0), so it asks for the item in the 0th index 4 times and gives you back an array. See the link for more details/examples.

When you index beta by player_idx, you ask, for each player, for the parameter vector associated with the player_idth position of beta’s 2nd dimension. That corresponds to the coefficient vector of that player.

So the problem you have is that mu and sigma are scalar numbers, so the entire matrix beta is all draws from one single N(\mu, \sigma) distribution. Under the hood PyMC is just drawing k * p numbers then reshaping them. If you want a different distribution, you need pass a vector valued mu and sigma, and PyMC is smart enough to give you back what you want. In this case, order of the dims matters – you need to give the batch dim first.

I made a little gist showing how this all works here.

For working through hierarchical pooling, I strongly recommend the radon case study notebook. I link this for my benefit as much as yours, because I misspoke when I said you are fully pooling across players. You have a parameter for each player linked by a single hyperprior, so you’re doing partial pooling. You’re just also pooling between parameter values, because all the parameters are linked together by the shared priors mu and sigma. Sorry for that error.

3 Likes

Ok this all makes so much sense now. Thanks for such thorough responses to my questions!

By “pitcher-player tuple” i mean an observation in your dataset. I guess I could just say “pitch”, but I wanted to emphasize that there is the pitcher dimension on each row…

:+1: No problem, I totally get it now. I just wanted to clarify that I was understanding correctly. It does help to emphasize the relationship between increasing the levels in the model and conditioning in the data (i.e., I observed pitches but want to model them based on pitcher, batter, home/away, number of beers consumed by the fans haha). If that makes sense.

arr = np.array([3, 2, 5])
arr[np.array([0, 0, 0, 0])] 
>>> [3, 3, 3, 3]

Here the indexing array is (0, 0, 0, 0), so it asks for the item in the 0th index 4 times and gives you back an array. See the link for more details/examples.

\uparrow This makes sense now. I suspected it was something like that, but I am very clear now on the indexing. It also clears up why coords has the unique list of pitchers and player_idx is the list of players in original data frame. Helps connect the relationship between xarray and everything under the hood.

So the problem you have is that mu and sigma are scalar numbers, so the entire matrix beta is all draws from one single N(μ,σ) distribution. Under the hood PyMC is just drawing k * p numbers then reshaping them. If you want a different distribution, you need pass a vector valued mu and sigma, and PyMC is smart enough to give you back what you want. In this case, order of the dims matters – you need to give the batch dim first.

I made a little gist showing how this all works here.

Got it! You are correct that I did want to assume separate distributions for each parameter. The gist is really helpful.

with pm.Model(coords=coords) as shape_demo:
    beta_mu = pm.Normal('mu', mu=np.arange(k), dims=['predictors'])
    beta_sigma = pm.HalfNormal('sigma', sigma=np.arange(k) + 1, dims=['predictors'])
    
    # player is the "batch dimension", and MUST come first!
    beta = pm.Normal('beta', mu=beta_mu, sigma=beta_sigma, dims=['player', 'predictors'])

Adding the offsets for mu and sigma \uparrow really helps to clarify the message when viewing the
numpy draws. Subtle but effective teaching tool!

For working through hierarchical pooling, I strongly recommend the radon case study notebook. I link this for my benefit as much as yours, because I misspoke when I said you are fully pooling across players. You have a parameter for each player linked by a single hyperprior, so you’re doing partial pooling. You’re just also pooling between parameter values, because all the parameters are linked together by the shared priors mu and sigma. Sorry for that error.

No problem, I can see clearly now which are pooled and which are partially pooled. Thanks for pointing me back to that notebook. It has been a while since I have read through it and looks likes they have added some really great additions that I had not seen before!

Thanks again for the great help!

1 Like