Is my hierarchal model set up correctly?

player_idxs, players = pd.factorize(stats.PLAYER)
coords = {
    "player": players, # 464 unique players
    "obs_id": np.arange(len(player_idxs)), # 6899 rows
}

stats[['PLAYER', 'EXPECTED', 'OBSERVED', 'RESIDUAL']].head()

From this I would like to be able to get the posterior distribution of residuals for a given player.

with pm.Model(coords=coords) as partially_pooled_model:

    # Independent parameters for each player
    player_idx = pm.Data("player_idx", player_idxs, dims="obs_id")
    
    # Hyperpriors for group nodes
    mu_a = pm.Normal("mu_a", mu=0.0, sigma=7, dims="player")
    sigma_a = pm.HalfNormal("sigma_a", 5.0, dims="player")
    
    # Data likelihood
    y_pred = pm.Normal("y_pred", mu_a[player_idx], sigma_a[player_idx], observed=stats.RESIDUAL, shape=stats['Name'].shape)
    
    posterior = pm.sample()
    posterior_pred = pm.sample_posterior_predictive(posterior)

The shape of posterior_pred['y_pred'] is (4000, 6899), to get the posterior distribution for a player would I have to iterate through the 4000 arrays and get all the values at each index that correspond to a player? For example the indices for the player “Joel Embiid” in the stats DataFrame are

[0,  179,  674, 1087, 1399, 1520, 1842, 2280, 2442, 2777, 3090, 3461, 3676, 4082, 4394, 4694, 4988, 5866, 6124, 6397, 6628]
1 Like

Yes, I think that’s the case. You’ll have to index somewhere to extract the residuals. If there were equal numbers of observations per player, there would likely be a cleaner solution but that does not appear to be true here.

2 Likes

Depending on exactly what you mean by “the posterior distribution of residuals”, you may be able to get away with just grabbing samples from posterior. The posterior predictive sampling you are doing just plugs in each player’s ~ mu_a and sigma_a into a normal distribution and takes N draws (where N is the number of observations for that player). If this is what you need, then great. But if you just want that normal distribution each player, you can just iterate over players, grab their mean/SD, use them to parameterize a normal distribution and stop.

1 Like

Thanks for the response!

Apologies as I am new to this, I may have formulated the model incorrectly, but my goal is to get an estimated distribution of residuals per player, based on that players individual residuals, along with the overall distribution of all players residuals.

I thought the samples from pm.sample_posterior_predictive stored in posterior_pred would give me sampled residuals and if I grab all samples for an individual player, that is how I could get the estimated residual distribution for a player, is my understanding incorrect?

1 Like

Your understanding sounds correct, it just may not be the most efficient or direct way to get what you need. Imagine a simplified scenario in which you have a single player with 5 observations. You model the observations as being normally distributed with some mean (mu) and SD (sigma). You sample

trace = pm.sample(1000, cores=2)

Now you have a 2000 credible mu-sigma pairs. You can now:

1 - Generate synthetic data sets, by calling sample_posterior_predictive(). Each of these data sets will consist of 5 credible observations (just like your original data).

2 - Take each of the 2000 mu-sigma pairs sitting in trace and use it to generate a new normal distribution (so 2000 in total). Each of these normal distribution captures a credible “origin” for your 5 observations.

Which of these is more useful for your purposes is up to you.

2 Likes