Help with Hierarchical Model Example

Hi all,

I am working my way through the tutorials and examples on the PyMC3 site, and I’m looking to get some feedback regarding my understanding of what is happening in the Rugby Example.

I am looking to understand more closely what PYTHON is doing, rather than the maths behind it at this stage, as I am still building my math skills up.

Pasting the model here for easy reference

with pm.Model() as model:
    # global model parameters
    home = pm.Flat("home")
    sd_att = pm.HalfStudentT("sd_att", nu=3, sigma=2.5)
    sd_def = pm.HalfStudentT("sd_def", nu=3, sigma=2.5)
    intercept = pm.Flat("intercept")

    # team-specific model parameters
    atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, shape=num_teams)
    defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, shape=num_teams)

    atts = pm.Deterministic("atts", atts_star - tt.mean(atts_star))
    defs = pm.Deterministic("defs", defs_star - tt.mean(defs_star))
    home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team])
    away_theta = tt.exp(intercept + atts[away_team] + defs[home_team])

    # likelihood of observed data
    home_points = pm.Poisson("home_points", mu=home_theta, observed=observed_home_goals)
    away_points = pm.Poisson("away_points", mu=away_theta, observed=observed_away_goals)

Let me try and explain what i understand and tell me if it’s right or badly wrong.

The shape parameter on atts_star parameter means for each MCMC trace, draw 6 values from this normal distribution (and store in an array).

The theano vector math here on the atts deterministic variable is just a little bit out of my reach at the moment, but what I think is happening is:

the atts_star array has the mean of the atts_star Distribution Object subtracted from each value. Doesn’t seem right.

The home_theta parameter means that it then does the above for ALL items in the home team array due to referencing atts[home_team]. So it would do the above 60 times (Length of the home_team array is 60), meaning a total of 360 draws from the atts_star normal distribution for each trace? (len of home_team array x num_teams, 60x6).

I hope this makes sense! Looking for any help to steer me in the right direction. Thanks in advance.

The shape=num_teams argument that is passed to pm.Normal() is specifying how many parameters you want. Here, you want 1 for each team. Each of these parameters will have a normal prior located at zero (mu=0) and an attack-specific scale parameter that is shared among all teams (sigma=sd_att). This line is roughly equivalent to this:

# for team 1
atts_star1 = pm.Normal("atts_star1", mu=0, sigma=sd_att)
# for team 2
atts_star2 = pm.Normal("atts_star2", mu=0, sigma=sd_att)
# for team 3
atts_star3 = pm.Normal("atts_star3", mu=0, sigma=sd_att)
...

The number of samples we request is only specified once we start sampling. Below, we call pm.sample(1000) which will draw 1000 samples. Each sample will consist of 1 value for each of the parameters in our model.

It looks like the 6 atts parameters are being scaled so that they capture relative strength. This makes the model easier because we don’t have to specify that each team’s tendency to score is a combination of their own attacking/defensive strength and of their opponent. The relative comparison is already baked into atts and defs (but not atts_star and defs_star).

The home_theta isn’t a parameter. It doesn’t, for example, have any prior specified. It’s just a variable just combines several of the parameters we have constructed above in some sensible way. This lines says that we expect a team’s tendency to score at home to be a combination of their (relative) attacking strength, their (relative) defensive strength, and some advantage due to playing at home (when they are playing at home). You are correct that this variable contains a value for all teams/games. The format/shape will be relevant because we will eventually want to connect this variable to the observed win/loss data (which we do in the 2 lines requesting pm.Poisson parameters that include the observed arguments). But as mentioned above, we aren’t sampling yet. We’re just specifying how all the pieces fit together.

Okay great, I’m understanding that bit correctly at least, though my explanation may have been a bit vague.

Okay, so if I was to continue using your example above for the atts variable, it would work something like:

# for team 1
atts1 = pm.Deterministic("atts1", atts_star1 - tt.mean(atts_star1))
# for team 2
atts2 = pm.Deterministic("atts2", atts_star2 - tt.mean(atts_star2))
# for team 3
atts3 = pm.Deterministic("atts3", atts_star3 - tt.mean(atts_star3))
...

I think i’m still falling over at the tt.mean function. Theano doco states that it is just a simple mean calculation, but the input for which is a

symbolic tensor (or compatible)

What information is atts_star passing to this function? Is it just simply the mean of the atts_star
normal distribution at that point in the sampling?

Okay perfect, so let’s see if I can understand what happens when we hit the magic inference button (not the actual MCMC process itself, but the sort of pseudo-code version):

  1. Global model parameters are initialized based on the inputs
  2. atts_star and defs_star are initialized as a scaffold, the shape parameter of num_teams means within these parameters 6 normal distibutions are initialized, and they will share a sd of sd_att and sd_def respectively. The output of these will be an array with a length of 6?
  3. atts and defs deterministic variables are initialized as a scaffold and told to reference the atts_star and defs_star distributions.
  4. home_theta and away_theta are initialized as a scaffold and told to reference the global intercept parameter (and home parameter for home_theta), as well as the atts variable indexed by home_team, which is an external variable. Repeat for the others.
  5. home_points and away_points parameters are initialized to tie it all together and is given our observed data

Now that all of the scaffolding is complete, the sampler is ready to take it’s first sample, this is how I imagine it is working, that the sampler starts from the bottom and works it’s way up through the model until it has all the values it needs:

home_points needs a value for mu from home_theta, so it goes there to get one
home_theta needs values from intercept, home, atts[0] and defs[0]. It also knows it’s shape is 60, so it needs to go and get a value from the listed parameters for each item in home_team and away_team. It now goes to atts for the first value it needs.
atts now knows it has a shape of 6 for the number of teams, it also needs a value for sd, so it reaches out to sd_att to get one.
sd_att sends back a value
atts_star then uses that value and makes 6 normal distributions using the sd value sent back.
atts then receives back an array of 6 values from atts_star. It then performs the calculation of atts_star - mean of the 6 returned values (Maybe?).
home theta gets its values from intercept, home, atts and defs 60 times. It then performs an exponential function on the entire result to return to home_points.

I think I’ve mixed up some things there, and it’s likely to be very hard to read, but i’d appreciate if you could tell me how close I am to understanding that process? Thanks mate!

This doesn’t quite work, because the mean in the original is being taken over the 6-parameter array. So to unpack the line you would have something akwark like this:

# for team 1
atts1 = pm.Deterministic("atts1", atts_star1 - 
                                ((atts_star1 +
                                  atts_star2 +
                                  atts_star3 +
                                 atts_star4 +
                                 atts_star5) / 6))

Obviously this isn’t very pretty, but I think it gets the job done.

All that really happens during sampling is that the sampler hands a set of parameter values to the model as input and receives a probability back as output. The input will include a value for every parameter in the model: home, sd_att, sd_def, intercept, atts_star[0], atts_star[1], etc. (but not atts, defs, home_theta, or away_theta, those can be calculated from the parameters). The probability that comes back is (loosely speaking) the posterior probability associated with those parameter values. Based on the probabilities returned by the model, the sampler walks around the parameter space, trying new parameter values as it moves.

One of the less intuitive things about many probabilistic programming languages is that the models tend to be very hollow. None of the parameters/variables have any values at the time that the model is built. Those values are only supplied later on. But at that point, the model is treated as a pretty boring input-output machine/function/black box.