Alight folks, I have a question that I haven’t been able to find an answer to. I want to create a model with fixed effects for a poisson distribution. But the number of effects per observation can vary. For example, in the attached data I have a list of observations for total points, and in each of those total points there are players indicated by integers that contribute to those points. But the total number of players in each observation varies. Is this possible?
model_data.csv (160.8 KB)
Can you post a datafile with the points by PlayerID for each game, such as they sum to the total points? A list column called points_by_player_id would work. I have an idea of how to approach this problem.
Also, I am looking at the Game_ID column. Why is there > 1 game per combination of Player_ID’s? What kind of game is this? Is this cricket with 2 innings, one for each team? If it is, probably better to have GameID and Inning_Number 1 or 2. Judging by the scores this looks like T-20 format.
Imagine your data looks like this:
You could usa a model that looks like the one below. I assumed it was T20 cricket. I put strong priors on the mean runs, and I ensured that a batsman could not score more than 150 runs in an inning by using a truncated NegativeBiomial.
# Set up the model with coordinates
coords = {
'obs_id': np.arange(df_batting_innings.shape[0]),
'player': pd.Categorical(df_batting_innings['player']).categories
}
with pm.Model(coords=coords) as model:
# Define the logp function for truncated Negative Binomial
def truncated_neg_binom_logp(value, mu, alpha, max_trunc):
# Use the Negative Binomial distribution
neg_binom = pm.NegativeBinomial.dist(mu=mu, alpha=alpha)
logp = pm.logp(neg_binom, value) # Correct use of pm.logp
# Apply truncation
return tt.switch(tt.gt(value, max_trunc), -np.inf, logp)
# Define the random function for truncated Negative Binomial
def truncated_neg_binom_random(mu, alpha, max_trunc, rng=None, size=None):
# Generate samples from Negative Binomial using numpy
samples = np.random.negative_binomial(n=alpha, p=alpha / (alpha + mu), size=size)
# Apply truncation
truncated_samples = np.clip(samples, None, max_trunc)
return truncated_samples
# Data containers for runs and player indices
runs = pm.Data('runs', df_batting_innings['runs'], dims='obs_id')
player_idx = pm.Data('player_idx', pd.Categorical(df_batting_innings['player']).codes, dims='obs_id')
# Global intercept and player intercept
global_intercept = pm.Normal('global_intercept', mu=np.log(20), sigma=np.log(5))
player_intercept = pm.Normal('player_intercept', mu=0, sigma=1, dims='player')
alpha = pm.Exponential('alpha', 1.0)
# Mean parameter for the Negative Binomial
mu = pm.math.exp(global_intercept + player_intercept[player_idx])
# Mean prediction for each player (player-specific mu)
player_mu = pm.Deterministic('player_mu', pm.math.exp(global_intercept + player_intercept), dims='player')
# Truncation value
max_trunc = 150
# Define the custom truncated Negative Binomial distribution for the observed data
observed_var = pm.CustomDist(
'observed_var',
mu, alpha, max_trunc, # Pass parameters positionally
random=truncated_neg_binom_random,
logp=truncated_neg_binom_logp,
observed=runs,
dtype='int'
)
idata = pm.sample(nuts_sampler='nutpie')
idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=False)
idata
It samples with no divergences.
and the posterior predictive looks ok:
To estimate runs in an inning, you can put random combinations of batsmen together or a specific combo doing something like this:
unique_players_list = df_batting_innings['player'].unique()
batters = np.random.choice(unique_players_list, 7, replace=False)
idata_mean_player_scores = np.rint(idata.posterior.player_mu)
idata_innings = idata_mean_player_scores.sel({'player':batters}).sum(dim='player')
idata_innings
And estimate the mean value of their combined runs in an inning. The posteriors would look like this:
I hope that this is helpful, as it resolves your concern about the differing number of batsmen in an inning.
Thanks for the suggestion. My data was NBA, although I’ve thought some about cricket as well so this was helpful. I guess what I was thinking was that in NBA each players strength is possibly dependent on the other players in the game. So if you are trying to infer the players strength from their own points scored then you might end up over or under estimating their strength. So instead of doing the sampling on each players points scored, I was thinking you could sample on the total team points instead. Whereas in cricket each player is batting effectively independently.
Here’s a semi relevant example where they infer the players strength based on the impact to the teams likelihood of winning
In this case they break the game up into sections where the 10 players on the court are known and so the total number of players is always 10. I could do the same but use points scored instead but it would require a lot of data engineering.
Yeah, that is way more complicated than in T20 cricket! What got me was the multiple shifts in a game, which I took to mean innings. My prior re: cricket kicked in. @AustinRochford wrote some blog posts a couple of years back re: IRT and NBA foul calls, which might be useful when thinking about your problem:
This might be useful too:
https://sethah.github.io/ncaa-ratings.html
The canonical PYMC sports article is Peadar Coyles’, which concerns rugby:
Oh yeah, definitely read these several times.
What about creating a “dummy player” and filling rosters to the max roster size with the dummy player. So for example, let’s say the team that plays the maximum number of players in a game is like 12, but another team only plays 7 guys in a game. We fill the remaining 5 spots with the dummy player. And then we make the dummy players effect always 0.
Is that possible to do?
Let’s say we treat points as multinomial on each simplex of player shifts. And each shift on each team has to have 5. Then I don’t think we need to add a dummy player. We just have to keep player index tracked so we can monitor their parameter. Then, we can take those posteriors and play with them. I think that makes sense. But I would need to play with the data and model. Happy to work on it with you, as it is an interesting problem, and I use the multinomial likelihood a fair deal and want to get more skilled at it.
PS: the other thing is how we make the values interdependent.