I am trying to estimate the true skill level of a collection of teams in an ultimate frisbee league. There are about 70 teams in the league and each team plays between 10 and 20 games. My dataset is a record of the participants in each game (Team A and Team B) and the outcome of that game (“Win” or “Loss” for Team A, neglect ties for now). There are a total of 500 or so games in the database.
I have a modeling approach that works well when the number of teams is around 20 or less and the remaining number of total games is around 200 or less. Unfortunately, when I scale this up to the full dataset with ~70 teams and ~500 games, I get an error message that sounds a lot like the one discussed in Issue #955: https://github.com/pymc-devs/pymc3/issues/955. I’m relatively new to PyMC3 and am wondering if there is something about how I’ve specified my model that is inefficient and could be better.
As an example, I am including a csv file with game results for an artificial league with 8 teams that each play around 15 games each.
artificial_scores.csv (1.5 KB)
Here is my model. I estimate the true skill level for each team initially as a Normal distribution with a mean of zero and a standard deviation of 1. I feed the difference in skill level into a logistic function to calculate the probability that “Team A” wins. The likelihood is a Bernoulli distribution with the observed values being a numpy array of Booleans indicating whether or not “Team A” won.
import pandas as pd import theano.tensor as tt import pymc3 as pm game_results = pd.read_csv('artificial_scores.csv') n_teams = 8 with pm.Model() as model: # the prior on the true skill level for each team skill = pm.Normal('skill', mu=0, sd=1, shape=n_teams) # Is there a better way to construct the model here? B_minus_A =  for row in game_results.index: team_A = game_results.loc[row, 'Team A'] team_B = game_results.loc[row, 'Team B'] B_minus_A.append(skill[team_B] - skill[team_A]) # Avoid 0 or 1 lower = 1e-6 upper = 1 - 1e-6 probability_A_beats_B = lower + (upper - lower) * 1 / (1 + tt.exp(B_minus_A)) # likelihood observation = pm.Bernoulli('observation', probability_A_beats_B, observed=game_results['Winner is A'].values) with model: trace = pm.sample(1000)
For this particular dataset, I get the following results for the true skill levels of each team:
True: -1.96; Estimated: -1.98 True: 1.07; Estimated: 0.68 True: -0.33; Estimated: -0.40 True: -2.82; Estimated: -2.63 True: 0.19; Estimated: 0.11 True: 0.72; Estimated: 0.23 True: 1.55; Estimated: 1.71 True: 2.50; Estimated: 2.39
Not bad. Would really like to be able to get this working with ~70 teams and ~500 games. Any help is appreciated!