Hierarchical Multinomial where groups change in each observation

Alright folks I could use some help on a project. I want to predict the number of kills for league of legends players in a match using a multinomial; effectively trying to divvy up the total team kills to each player. To start I wanted to make a model with a player effect that is hierarchical. The challenge here is that, while the number of players in a match is constant (5), the identity of those players change in each match. The closest example I could find of something like this was this NUTS won't sample in Hierarchical multinomial · Issue #3166 · pymc-devs/pymc · GitHub. But when I try to implement a similar approach, i.e.

blue_team_player_kills_cols = [f'Blue_player_{i}_kills' for i in range(1,6)]
blue_team_player_idx_cols = [f'Blue_player_{i}_idx' for i in range(1,6)]

blue_side_df = df_pivot[blue_team_player_idx_cols + blue_team_player_kills_cols]

with pm.Model(coords=player_coords) as hierarchical_model:
    
    player_idx = pm.MutableData("player_idx", blue_side_df[blue_team_player_idx_cols].values, dims=['match', "player_per_match"])
    
    # Hyperpriors for group nodes
    mu_a = pm.Dirichlet('alpha', a=np.ones(len(blue_side_df)), dims='match')
    
    mu_a_inv = pm.math.softmax(mu_a)
    
    a = pm.MvNormal('a', mu=mu_a_inv, cov=np.eye(len(blue_side_df)), dims = ['match', 'player_per_match'])
    
    p = pm.math.softmax(a[player_ids])
    
    # Data likelihood
    y_obs = pm.Multinomial('y_obs', 
                           n=blue_side_df[blue_team_player_kills_cols].sum(axis=1), 
                           p=radon_est, 
                           observed=blue_side_df[blue_team_player_kills_cols])

I get a shape error when trying to run sample_prior_predictive()

ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (34770,3477) and requested shape (3477,3477)

I’ve read a lot of @AlexAndorra 's work on elections using a hierarchical multinomial, but looked like in that case the identity of the groups (i.e., the parties) is constant across observations. And whenever a party did not participate, their vote count was set to 0. In theory, you could do that here but it seems like maybe a bad choice because most players wont player in most games so the model will assume the number of kills by a player is most often 0. And in future predictions we’d want to set most players kills to 0 anyway (because they aren’t playing).

Any help is much appreciated. I’m super stuck here.
multinomial_test_data.csv (99.9 KB)

This may be helpful for the shape issues: Distribution Dimensionality — PyMC 5.19.1 documentation

If there is indeed a shape issue, might a workaround be to work with “long format” data? I.e. stack your columns into the index then reset the index? If your data isn’t encoded already you can also then define player_idx as a categorical series, which is nice as you can pick up the codes to select the correct indices of the RVs you’ve defined. The categories would used as the player_coords.

I see you’ve uploaded your data; I’ll give this a crack later, as writing out code is likely clearer than my explanation!

2 Likes

Rethinking this, stacking won’t work as you need to preserve the marginal counts vs total counts for each match.

There is a kind of similarity to the baseball batting abilities example in Kruschke’s book: Doing Bayesian Data Analysis see p253 (I hope this is a kosher link, I have a hardcopy at home). Suggest you get your head around that, in case useful?

I’m not at all familiar with League of Legends, but is it a reasonable modelling assumption to assume that proportions of kills in each match relate to a players abilities (expressed via “effect” from baseline)? Does the composition of the group matter, i.e. if you have 5 average players in a match together, how would the data differ from if you had one superstar and 4 average joes? I think the upper bits of your hierarchical model may need thinking through in this respect, and could then inform what the lower bits (seen here) look like.

Interesting problem though!

I think long-form data is indeed the way to go here. I would do something like this:

cols = [f'Blue_player_{i}' for i in range(1, 6)]
new_df = None
for col in cols:
    idx_col = df[[col + '_idx']]
    kills_col = df[[col + '_kills']]
    tmp = (idx_col
               .join(kills_col)
               .set_axis(axis=1, labels = ['player_id', 'kills'])
               .reset_index()
               .rename(columns={'index':'match_id'}))
    new_df = tmp if new_df is None else pd.concat([new_df, tmp], ignore_index=True)
match_total = new_df.groupby('match_id').kills.sum().rename('match_total').astype('int')
new_df = pd.merge(new_df, match_total, left_on='match_id', right_index=True)

This results in the following structure:

   match_id  player_id  kills  match_total
0         0         41      3            4
1         1         42      2           19
2         2         42      2            8
3         3         41      2           18
4         4         41      0           19

From here I model each player’s contribution as Binomial (i.e. the fraction of match_total attributable to a given player), using a player effect and a match effect. Obviously you can bring hierarchy back into either or both of these, or get fancy and consider network effects between pairs or players or whatever:

match_idx, matches = pd.factorize(new_df.match_id)
player_idx, players = pd.factorize(new_df.player_id)

coords = {
    'match': matches,
    'player': players,
    'obs_idx': new_df.index
}

with pm.Model(coords=coords) as hierarchical_model:
    player_idx_pt = pm.Data("player_idx",
                                player_idx, 
                                dims=['obs_idx'])
    match_idx_pt = pm.Data("match_idx",
                                match_idx, 
                                dims=['obs_idx'])
    kills = pm.Data("kills", new_df.kills.values, dims=['obs_idx'])
    total_kills = pm.Data('total_kills', new_df.match_total.values.ravel(), dims=['obs_idx'])
    
    match_effect = pm.Normal('match_effect', mu=0, sigma=3, dims='match')
    player_effect = pm.Normal('player_effect', mu=0, sigma=3, dims='player')
    
    logit_p = match_effect[match_idx_pt] + player_effect[player_idx_pt]    
    p = pm.math.invlogit(logit_p)
    
    y_obs = pm.Binomial('y_obs', 
                        n=total_kills, 
                        p=p, 
                        observed=kills,
                        dims=['obs_idx'])
    idata = pm.sample()

Let me know if I missed the important point when I removed the Multinomial

Ok I like the idea of the long form data with a match ID effect. My goal is to capture the correlation between players in the same match. Ideally the sum of the individual player kills in a sample would match the total team kills. I dont think that’s guaranteed in this approach but you could always normalize back to the total team kills.

The part I’m struggling with now, in this framework, is how to make predictions for future matches. In that case you’ll have a brand new match ID. You could sample the match effect from the prior but that assumes you know nothing about the collection of 5 players, which I dont think is true. Maybe instead of doing a match ID effect you assign a unique identifier to any match that has the same collection of 5 players. I think that probably catches some of the correlation between players in a given match, but less so than the strictly match ID. But it becomes more obvious how to do future predictions given your new collection of 5 players. You’ll either have an ID that matches that collection, or you could do a similarity analysis and assign the ID associated with the closest collection.

You can get back to the multinomial setup by sorting the long data by match_id (so that every 5 rows is a team) then doing a .reshape. That is, just add:

new_df = new_df.sort_values(by='match_id')

Then you can use match_idx created by pd.factorize as a sanity check:

match_idx, matches = pd.factorize(new_df.match_id)
match_idx.reshape(-1, 5)

array([[   0,    0,    0,    0,    0],
       [   1,    1,    1,    1,    1],
       [   2,    2,    2,    2,    2],
       ...,
       [3474, 3474, 3474, 3474, 3474],
       [3475, 3475, 3475, 3475, 3475],
       [3476, 3476, 3476, 3476, 3476]])

Which is what you want. You can include or exclude match effects as you wish. Right now they are serving as a “team effect”, which is capturing all the player-player interactions. You probably want to explicitly model those. You can do that with e.g. a network-based model. If you keep them, you just use the match population mean for unseen matches.

Anyway we can drop the match effects and reshape logit_p to get to the setup you started with:

match_idx, matches = pd.factorize(new_df.match_id)
player_idx, players = pd.factorize(new_df.player_id)

coords = {
    'match': matches,
    'player': players,
    'position':range(5),
    'obs_idx': new_df.index
}

with pm.Model(coords=coords) as hierarchical_model:
    player_idx_pt = pm.Data("player_idx",
                                player_idx, 
                                dims=['obs_idx'])
    kills = pm.Data("kills", new_df.kills.values.reshape(-1, 5), dims=['match_idx', 'position'])
    match_totals = pm.Data('total_kills', match_total.values, dims=['match_idx'])
    
    alpha = pm.Normal('alpha')
    
    player_loc = pm.Normal('player_loc', mu=0, sigma=1)
    player_scale = pm.Gamma('player_scale', alpha=2, beta=1)
    player_offset = pm.ZeroSumNormal('player_offset', dims=['player'])
    player_effect = pm.Deterministic('player_effect', player_loc + player_scale * player_offset,
                                    dims=['player'])
    
    logit_p = alpha + player_effect[player_idx_pt]    
    p = pm.math.softmax(logit_p.reshape((-1, 5)), axis=-1)
    
    y_obs = pm.Multinomial('y_obs', 
                        n=match_totals, 
                        p=p, 
                        observed=kills,
                        dims=['match_idx', 'position'])
    
    player_result = pm.Deterministic('player_result', y_obs.ravel(), dims=['obs_idx'])
    
    idata = pm.sample(init='jitter+adapt_diag_grad')

position is a dummy index, since we don’t actually care about the ordering of the players on the roster within each match (unless you do, in which case you can just encode that as a feature column). The player_result deterministic is included for convenience. You can add a multi-index to the posterior to make it easier to work with:

import xarray as xr
idx = xr.Coordinates.from_pandas_multiindex(new_df.set_index(['match_id', 'player_id']).index, 'obs_idx')
idata.posterior = idata.posterior.assign_coords(idx)

Then query the outcome for different players. You’ll have to decide what to do about the fact that you get out predictions on a match-wise basis. For example, we can compute the expected number of kills for player_id = 123 across all his matches:

import xarray as xr
idx = xr.Coordinates.from_pandas_multiindex(new_df.set_index(['match_id', 'player_id']).index, 'obs_idx')
idata.posterior = idata.posterior.assign_coords(idx)

with hierarchical_model:
    idata = pm.sample_posterior_predictive(idata)
    
idata.posterior_predictive = idata.posterior_predictive.assign_coords(idx)
az.plot_posterior(idata.posterior_predictive.player_result.sel(player_id=123).mean(dim='match_id'))

3 Likes

Perfect, love it. Thanks for all your help :+1: