Hi,
I am trying to estimate the true skill level of a collection of teams in an ultimate frisbee league. There are about 70 teams in the league and each team plays between 10 and 20 games. My dataset is a record of the participants in each game (Team A and Team B) and the outcome of that game (“Win” or “Loss” for Team A, neglect ties for now). There are a total of 500 or so games in the database.
I have a modeling approach that works well when the number of teams is around 20 or less and the remaining number of total games is around 200 or less. Unfortunately, when I scale this up to the full dataset with ~70 teams and ~500 games, I get an error message that sounds a lot like the one discussed in Issue #955: https://github.com/pymc-devs/pymc3/issues/955. I’m relatively new to PyMC3 and am wondering if there is something about how I’ve specified my model that is inefficient and could be better.
As an example, I am including a csv file with game results for an artificial league with 8 teams that each play around 15 games each.
artificial_scores.csv (1.5 KB)
Here is my model. I estimate the true skill level for each team initially as a Normal distribution with a mean of zero and a standard deviation of 1. I feed the difference in skill level into a logistic function to calculate the probability that “Team A” wins. The likelihood is a Bernoulli distribution with the observed values being a numpy array of Booleans indicating whether or not “Team A” won.
import pandas as pd
import theano.tensor as tt
import pymc3 as pm
game_results = pd.read_csv('artificial_scores.csv')
n_teams = 8
with pm.Model() as model:
# the prior on the true skill level for each team
skill = pm.Normal('skill', mu=0, sd=1, shape=n_teams)
# Is there a better way to construct the model here?
B_minus_A = []
for row in game_results.index:
team_A = game_results.loc[row, 'Team A']
team_B = game_results.loc[row, 'Team B']
B_minus_A.append(skill[team_B] - skill[team_A])
# Avoid 0 or 1
lower = 1e-6
upper = 1 - 1e-6
probability_A_beats_B = lower + (upper - lower) * 1 / (1 + tt.exp(B_minus_A))
# likelihood
observation = pm.Bernoulli('observation', probability_A_beats_B, observed=game_results['Winner is A'].values)
with model:
trace = pm.sample(1000)
For this particular dataset, I get the following results for the true skill levels of each team:
True: -1.96; Estimated: -1.98
True: 1.07; Estimated: 0.68
True: -0.33; Estimated: -0.40
True: -2.82; Estimated: -2.63
True: 0.19; Estimated: 0.11
True: 0.72; Estimated: 0.23
True: 1.55; Estimated: 1.71
True: 2.50; Estimated: 2.39
Not bad. Would really like to be able to get this working with ~70 teams and ~500 games. Any help is appreciated!
Thanks,
Shane Bussmann