Likelihood with variable number of effects per observation

jaharvey8 · September 23, 2024, 4:17pm

Alight folks, I have a question that I haven’t been able to find an answer to. I want to create a model with fixed effects for a poisson distribution. But the number of effects per observation can vary. For example, in the attached data I have a list of observations for total points, and in each of those total points there are players indicated by integers that contribute to those points. But the total number of players in each observation varies. Is this possible?
model_data.csv (160.8 KB)

zweli · September 24, 2024, 1:32am

Can you post a datafile with the points by PlayerID for each game, such as they sum to the total points? A list column called points_by_player_id would work. I have an idea of how to approach this problem.

Also, I am looking at the Game_ID column. Why is there > 1 game per combination of Player_ID’s? What kind of game is this? Is this cricket with 2 innings, one for each team? If it is, probably better to have GameID and Inning_Number 1 or 2. Judging by the scores this looks like T-20 format.

zweli · September 24, 2024, 10:57am

Imagine your data looks like this:

Screenshot 2024-09-24 at 6.51.40 AM

You could usa a model that looks like the one below. I assumed it was T20 cricket. I put strong priors on the mean runs, and I ensured that a batsman could not score more than 150 runs in an inning by using a truncated NegativeBiomial.

# Set up the model with coordinates
coords = {
    'obs_id': np.arange(df_batting_innings.shape[0]),
    'player': pd.Categorical(df_batting_innings['player']).categories
}

with pm.Model(coords=coords) as model:
    
    # Define the logp function for truncated Negative Binomial
    def truncated_neg_binom_logp(value, mu, alpha, max_trunc):
        # Use the Negative Binomial distribution
        neg_binom = pm.NegativeBinomial.dist(mu=mu, alpha=alpha)
        logp = pm.logp(neg_binom, value)  # Correct use of pm.logp
        # Apply truncation
        return tt.switch(tt.gt(value, max_trunc), -np.inf, logp)

    # Define the random function for truncated Negative Binomial
    def truncated_neg_binom_random(mu, alpha, max_trunc, rng=None, size=None):
        # Generate samples from Negative Binomial using numpy
        samples = np.random.negative_binomial(n=alpha, p=alpha / (alpha + mu), size=size)
        # Apply truncation
        truncated_samples = np.clip(samples, None, max_trunc)
        return truncated_samples
    
    # Data containers for runs and player indices
    runs = pm.Data('runs', df_batting_innings['runs'], dims='obs_id')
    player_idx = pm.Data('player_idx', pd.Categorical(df_batting_innings['player']).codes, dims='obs_id')
    
    # Global intercept and player intercept
    global_intercept = pm.Normal('global_intercept', mu=np.log(20), sigma=np.log(5))
    player_intercept = pm.Normal('player_intercept', mu=0, sigma=1, dims='player')
    alpha = pm.Exponential('alpha', 1.0)
    
    # Mean parameter for the Negative Binomial
    mu = pm.math.exp(global_intercept + player_intercept[player_idx])
    
    # Mean prediction for each player (player-specific mu)
    player_mu = pm.Deterministic('player_mu', pm.math.exp(global_intercept + player_intercept), dims='player')
    
    # Truncation value
    max_trunc = 150
    
    # Define the custom truncated Negative Binomial distribution for the observed data
    observed_var = pm.CustomDist(
        'observed_var',
        mu, alpha, max_trunc,  # Pass parameters positionally
        random=truncated_neg_binom_random,
        logp=truncated_neg_binom_logp,
        observed=runs,
        dtype='int'
    )
  
    idata = pm.sample(nuts_sampler='nutpie')
    
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=False)
    
idata

It samples with no divergences.

and the posterior predictive looks ok:

To estimate runs in an inning, you can put random combinations of batsmen together or a specific combo doing something like this:

unique_players_list = df_batting_innings['player'].unique()
batters = np.random.choice(unique_players_list, 7, replace=False)

idata_mean_player_scores = np.rint(idata.posterior.player_mu)
idata_innings = idata_mean_player_scores.sel({'player':batters}).sum(dim='player')

idata_innings

And estimate the mean value of their combined runs in an inning. The posteriors would look like this:

I hope that this is helpful, as it resolves your concern about the differing number of batsmen in an inning.

jaharvey8 · September 26, 2024, 12:26pm

Thanks for the suggestion. My data was NBA, although I’ve thought some about cricket as well so this was helpful. I guess what I was thinking was that in NBA each players strength is possibly dependent on the other players in the game. So if you are trying to infer the players strength from their own points scored then you might end up over or under estimating their strength. So instead of doing the sampling on each players points scored, I was thinking you could sample on the total team points instead. Whereas in cricket each player is batting effectively independently.

Here’s a semi relevant example where they infer the players strength based on the impact to the teams likelihood of winning

In this case they break the game up into sections where the 10 players on the court are known and so the total number of players is always 10. I could do the same but use points scored instead but it would require a lot of data engineering.

zweli · September 26, 2024, 1:00pm

Yeah, that is way more complicated than in T20 cricket! What got me was the multiple shifts in a game, which I took to mean innings. My prior re: cricket kicked in. @AustinRochford wrote some blog posts a couple of years back re: IRT and NBA foul calls, which might be useful when thinking about your problem:

This might be useful too:
https://sethah.github.io/ncaa-ratings.html

The canonical PYMC sports article is Peadar Coyles’, which concerns rugby:

jaharvey8 · September 26, 2024, 1:36pm

Oh yeah, definitely read these several times.

What about creating a “dummy player” and filling rosters to the max roster size with the dummy player. So for example, let’s say the team that plays the maximum number of players in a game is like 12, but another team only plays 7 guys in a game. We fill the remaining 5 spots with the dummy player. And then we make the dummy players effect always 0.

Is that possible to do?

zweli · September 26, 2024, 1:51pm

Let’s say we treat points as multinomial on each simplex of player shifts. And each shift on each team has to have 5. Then I don’t think we need to add a dummy player. We just have to keep player index tracked so we can monitor their parameter. Then, we can take those posteriors and play with them. I think that makes sense. But I would need to play with the data and model. Happy to work on it with you, as it is an interesting problem, and I use the multinomial likelihood a fair deal and want to get more skilled at it.

PS: the other thing is how we make the values interdependent.

Topic		Replies	Views
Modeling with count data as predictors and continuous as outcome variable in pymc v5 hierarchical	2	1125	November 9, 2022
Zero Inflated Poisson model with aggregated data Questions	3	751	October 5, 2020
Modelling groups with different number of observations Questions	3	1251	August 19, 2018
Bad initial energy, check any log probabilities that are inf or -inf, nan or very small: Questions	1	964	October 3, 2020
Multiple observations per outcome (group-level likelihood) Questions	5	603	February 25, 2021

Likelihood with variable number of effects per observation

Related topics