Tournament Skill Estimator, some modelling challenges

Hi all,

I’ve been trying to get some pymc3 models to work for my work & hobbies.
Currently, I’m trying to make a skill estimator for a board game (fyi: Terraforming Mars), and I’m running into challenges regarding convergence (/“anchoring”) and modelling challenges.

Unlike most games, we can not simply rely on the total points gathered by players, as the duration of the game varies a lot per game and is not necessarily related to players’ skill levels.

The tournament will contain 3 rounds, with a match holding 3 to 4 players, and I intend to depend on the players’ pairwise performance difference.

E.g., a match has 4 players (A,B,C,D) with scores: 54 63 65 49. That means pairwise performance differences are: A-B = -9, A-C = -11, A-D = 5, B-C = -2, B-D = 14, C-D=16.

My current approach is to model each player, their performance and difference as follows:

skill_i \sim Normal(0, 10), \forall_{i \in Players}\\ performance_{r,i} \sim Normal(skill_i, 5), \forall_{r \in Rounds, i \in Players}\\ difference_{i,j} = Normal(performance_{r,i} - performance_{r,j}, 10), \forall_{i,j \in match_r}

The associated code for just two rounds performances and who-is-matched-to-who:

realised_performance = np.array([[[68, 61, 62, 69],
        [65, 85, 95, 72],
        [80, 69, 71, 59],
        [70, 74, 55, 68],
        [61, 79, 60, 63],
        [81, 66, 73, 73],
        [70, 98, 78, 81],
        [66, 49, 61, 54]],
       [[58, 92, 80, 81],
        [56, 74, 63, 66],
        [86, 77, 92, 86],
        [67, 64, 54, 56],
        [72, 74, 57, 81],
        [68, 67, 69, 54],
        [74, 82, 93, 72],
        [59, 75, 91, 81]]])

roundassignment = np.array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23],
        [24, 25, 26, 27],
        [28, 29, 30, 31]],
       [[ 1, 29,  4, 24],
        [21, 18, 14, 11],
        [15, 26, 22, 16],
        [31,  9,  2,  7],
        [ 0, 10, 30, 19],
        [27, 12, 23,  5],
        [28,  3, 13, 20],
        [ 8, 17, 25,  6]]])

player = []
oppo = []
obsdiff = []
for round, round_performance in enumerate(realised_performance):
    for table, match in enumerate(round_performance):
        players_table = roundassignment[round]
        for i, player_in_table in enumerate(players_table):
            for j, oppo_in_table in enumerate(players_table):
                if i >= j :
                obsdiff.append(match[i] - match[j])

Then, in pymc3 I modelled it as follows:

with pm.Model() as model_all:
    skill = pm.Normal('skill', 0, 10, shape=(32,1))
    performance = pm.Normal('performance', skill, sd=5, shape=(32, 1))
    rel_dif = performance[player] - performance[oppo]
    dif = pm.Normal('dif', rel_dif, sd=10, observed=obsdiff)
    trace = pm.sample(1000, tune=1000)

I have tried various options, but I often get acceptance and Rhat warnings. The found values also appear to be very close to each other (all 5 with std of 0.01, or all 10 with same std). I was hoping to see skill corresponding to ‘this player has usually 5 more points than that player’.

I realize that there are a few oddities in my modelling choices: the performance and differences are supposed to be integers, but I currently model them as continuous variables. Likewise, I do believe that modelling the difference as a Normal distribution might also not be the ideal option.

I could definitely use some pointers for:

  1. How to stabilize the convergence? Should I try to anchor the worst player for example?
  2. Modelling choices regarding difference


If you plot the performance differences, it is some what looks like a Gaussian?

Also, I would first try restricting the prior to Normal(0, 1) - you need strong regulation of your parameters, because otherwise you have model unidentified problem (i.e., same performance differences but different latent ability)

You can have a look at the discussion around rating model here as well:

Thank you for your reply;

I just tried your first suggestion, but even with a very strong regulation (Normal(0,1)) my results appear to find different latent abilities, which just so happen to be around the same value.

The performance differences indeed approximate a Gaussian.

I have since then quickly written a MLE approach where I find quick good results. Players have a skill of 10, 20, 5, or -10 and so on. I was hoping that the skill prior with Normal(0, 10) would actually allow such values to be found; considering that players differ 0 to 30 points between each other. Is there any way that I can anchor one of the parameters and have all other parameters adapt to that one, or some other way where I can address the scattered latent ability? I believe had a similar issue before with a different topic in pymc3.

FYI, your search link simply returns everything with either the word ‘rate’ or ‘model’ and “rate model” only returns this very topic. :slight_smile:

I think so, the trick is to model N-1 player, with the Nth player being a fix value (e.g., 0)

If you search rating model there are a few posts.

I have tried anchoring as well as tried a different way of parameterizing the prior; neither helped the situation. I am still a bit dumbfounded how I can fix the multi-modality situation.
I anchored the first player such that it has to be extremely close to 0:

anchor = pm.Normal('anchor', skill[0], sd=0.001, observed=0)

I have searched with your ‘rating model’. If you search “rating model” the only one popping up is this thread. Separately, model runs all kinds of model comparisons which isn’t the same as a rating model; and on rating it is simply estimating rates. Nothing that actually has to do with comparisons between rates. The only thing I found was:

  • Thurstone Comparative rankings and 2D shape-parameters
    which has some similarities in topic but approaches it differently that unfortunately isnt applicable to my situation.
  • although very interesting and closely related, they approach the scoring from a total points rather than the difference. In my situation I argue that modelling the total points isn’t very plausible as games have varying length unlike Rugby.
    Hence, I am not sure which posts you are referring to.

Did you have a look at:


They are both rating models.

Thanks a lot for those links.

They really did not show up for me; I found more using google search.
I may try to follow the first one, albeit his modelling is substantially different.

Do you have any recommendations regarding multi-modality? Perhaps search terms?

1 Like

I think in your case it is not multi-modality, but model unidentifiable. I would imaging anchoring the score of the first player should works. How is the trace plot looks like?