Model compilation difficulties with medium-size dataset


#1

Hi,

I am trying to estimate the true skill level of a collection of teams in an ultimate frisbee league. There are about 70 teams in the league and each team plays between 10 and 20 games. My dataset is a record of the participants in each game (Team A and Team B) and the outcome of that game (“Win” or “Loss” for Team A, neglect ties for now). There are a total of 500 or so games in the database.

I have a modeling approach that works well when the number of teams is around 20 or less and the remaining number of total games is around 200 or less. Unfortunately, when I scale this up to the full dataset with ~70 teams and ~500 games, I get an error message that sounds a lot like the one discussed in Issue #955: https://github.com/pymc-devs/pymc3/issues/955. I’m relatively new to PyMC3 and am wondering if there is something about how I’ve specified my model that is inefficient and could be better.

As an example, I am including a csv file with game results for an artificial league with 8 teams that each play around 15 games each.

artificial_scores.csv (1.5 KB)

Here is my model. I estimate the true skill level for each team initially as a Normal distribution with a mean of zero and a standard deviation of 1. I feed the difference in skill level into a logistic function to calculate the probability that “Team A” wins. The likelihood is a Bernoulli distribution with the observed values being a numpy array of Booleans indicating whether or not “Team A” won.

import pandas as pd
import theano.tensor as tt
import pymc3 as pm

game_results = pd.read_csv('artificial_scores.csv')
n_teams = 8

with pm.Model() as model:

    # the prior on the true skill level for each team
    skill = pm.Normal('skill', mu=0, sd=1, shape=n_teams)

    # Is there a better way to construct the model here?
    B_minus_A = []
    for row in game_results.index:
        team_A = game_results.loc[row, 'Team A']
        team_B = game_results.loc[row, 'Team B']
        B_minus_A.append(skill[team_B] - skill[team_A])
        
    # Avoid 0 or 1
    lower = 1e-6
    upper = 1 - 1e-6
    probability_A_beats_B = lower + (upper - lower) * 1 / (1 + tt.exp(B_minus_A))
    
    # likelihood
    observation = pm.Bernoulli('observation', probability_A_beats_B, observed=game_results['Winner is A'].values)

with model:
    trace = pm.sample(1000)

For this particular dataset, I get the following results for the true skill levels of each team:

True: -1.96; Estimated: -1.98
True: 1.07; Estimated: 0.68
True: -0.33; Estimated: -0.40
True: -2.82; Estimated: -2.63
True: 0.19; Estimated: 0.11
True: 0.72; Estimated: 0.23
True: 1.55; Estimated: 1.71
True: 2.50; Estimated: 2.39

Not bad. Would really like to be able to get this working with ~70 teams and ~500 games. Any help is appreciated!

Thanks,
Shane Bussmann


#2

The problem is that you are creating a new theano variable for each row in your dataframe. You can actually just index skill with the array of integers:

# instead of the for loop
B_minus_A = skill[game_results['Team A'].values] - skill[game_results['Team B'].values]

But shouldn’t the skills have a scale parameter, something like this:

team1 = game_results['Team A'].values
team2 = game_results['Team B'].values
with pm.Model() as model:
    skill_sd = pm.HalfStudentT('skill_sd', sd=2.5, nu=3)
    skill = pm.Normal('skill', shape=n_teams)
    
    logit_p = skill_sd * (skill[team1] - skill[team2])
    p = tt.nnet.sigmoid(logit_p)
    pm.Bernoulli('win', p, observed=game_results['Winner is A'].values)

#3

Also, your model is simple enough to just fit it with ADVI, in large data set it could be faster:

with model:
    approx = pm.fit()
    trace = approx.sample(1000)

#4

Thanks @aseyboldt, the suggestion to index skill with the array of integers made all the difference for me! I am now getting really good results out of this process.

Here is real, anonymized data for one league: anon_scores.csv (17.8 KB)

There is an additional wrinkle compared to the artificial dataset I uploaded previously. Each team is associated with a “division” or a level of competition. Teams at the division 1 level are the strongest, teams at the division 2 level are average, and teams at the division 3 level are the weakest. I am accounting for this by setting mu in the Normal prior for team skill to be equal to 3.0 for division 1 teams, 0.0 for division 2 teams, and -1.0 for division 3 teams. The idea is that a division 1 team should beat a division 2 team ~95% of the time and a division 2 team should beat a division 3 team ~75% of the time. The exact percentage depends on scale, which I am now including in the model per @aseyboldt’s suggestion. It was in my original model, but I had taken it out in an effort to improve the slow run time of the original model.

Here is the full code I’m using to model the data:

anon_scores = pd.read_csv('anon_scores.csv', index_col='index')
team_list_A = anon_scores['Index A'].unique()
team_list_B = anon_scores['Index B'].unique()
team_list = set(np.append(team_list_A, team_list_B))
n_teams = len(team_list)

skill_prior_div = {
    '4/3 Div 1': 3.0,
    '4/3 Div 2': 0.0,
    '4/3 Div 3': -1.0,
    '5/2 Div 1': 3.0,
    '5/2 Div 2': 0.0,
    '5/2 Div 3': -1.0
}

# alphas define `mu` in the prior on `skill`
alphas = []
for i in range(n_teams):
    if i in anon_scores['Index A'].values:
        index = anon_scores['Index A'] == i
        div = anon_scores.loc[index, 'Div A'].unique()[0]
        alpha = skill_prior_div[div]
        alphas.append(alpha)
    else:
        index = anon_scores['Index B'] == i
        div = anon_scores.loc[index, 'Div B'].unique()[0]
        alpha = skill_prior_div[div]
        alphas.append(alpha)

team1 = anon_scores['Index A'].values
team2 = anon_scores['Index B'].values

with pm.Model() as model:
    skill_sd = pm.HalfStudentT('skill_sd', sd=2.5, nu=3)
    skill = pm.Normal('skill', mu=alphas, shape=n_teams)
    
    logit_p = skill_sd * (skill[team1] - skill[team2])
    p = tt.nnet.sigmoid(logit_p)
    win = pm.Bernoulli('win', p, observed=anon_scores['Team A Wins'].values)

with model:
    trace = pm.sample(1000)

Here are the average posterior skill levels for the top 10 teams according to existing, heuristic-based approach that does not take into account strength of schedule:

original_rank bayesian_skill
1: 4.392
2: 3.002
3: 2.840
4: 2.135
5: 3.128
6: 1.956
7: 1.790
8: 1.537
9: 1.160
10: 1.390

The posterior PDF for Team 5 suggests that it is actually the second best team in the league. Looking at the schedule, it is clear that Team 5 had a more difficult schedule than Teams 2, 3, and 4. In fact, Team 5 went 1-1 against Team 2, and 1-0 against Team 3 and Team 4.

This occurs throughout the list of teams – when there is a significant disagreement between the old, heuristics based approach and this Bayesian approach, the Bayesian one wins. Very cool!

Thanks for the help on this, I really appreciate it!

P.S. @junpenglao, how do you make a determination that a particular model is simple enough to just fit with ADVI?


#5

Really cool analysis and insight - you should write it into a case study or blog post!

You can fit very complicate model with ADVI as well - what I meant is that in a simple model (few hierarchy) it is easier to validate the fit from VI approach, you can trust the fit.