Fantasy Football Hierarchical Model Questions

Hi all, I’m working on adapting the rugby hierarchical model example for (American) fantasy football.

My intention is the following (and I think I successfully did this in my code below? Running locally I get almost no divergences):

  • Goal: for a specified position (ex: only quarterbacks), get latent variables describing a player’s ability to score points
  • Background: in fantasy football, you draft a roster of players that score points each week depending on how they play in the actual football game. Specifically, prior to the season, players are attributed overall projected points (float dtype) that encompass the entire season. My intention is to use these season projected points, coupled with inference based on the season schedule, to ultimately get per-game projected points.
  • Method: Like the rugby example, I am using a log-linear random effect model. For a given player in a given game, log\theta_{player, game} = player_{player} + off_{player team} + def_{opponent team}. Then, for the entire season, \theta_{player}=\sum_{game} \theta_{player, game}
    • player_{player} represents an individual player’s contribution; it is supposed to capture effects not related to the broader teams that are playing
    • off_{player team} represents the rest of the player’s team’s (they’re on offense) overall contributions; ex: a quarterback earns points only if the player they pass to catches it, a running back only earns points if the team blocks the opponent, etc.
    • def_{opponent team} represents the opposing team’s (they’re on defense) overall contributions; ex: a great running defense makes it very difficult for a running back to get points
  • Notes: points cannot be negative, hence my choice for a truncated normal likelihood. However, this may not be appropriate given that some players have projections of zero points for the season - they’re usually bench players that (at least in a fully healthy season for the rest of the team) won’t see game time. It may ultimately be worth filtering the data above a minimum threshold of projected points if it enables the model to sample much better.

My overall model has the following structure. Please see the code linked below for the nuances of the variables, but they again follow a similar idea as in the rugby example.

The above graph was generated for the quarterback position. There are 105 quarterbacks in the dataset. There are 32 teams in the NFL, and each team plays 17 games. 17 * 105 = 1785, which is the size of the dataset and where the dimensionality for theta_weekly comes from. Aside - one game is played by each team per week, so “weekly” makes sense as a variable name.


  1. What tricks can I do to help the model sample faster? I imagine that my priors aren’t very informative, and my likelihood function may not even be appropriate. And that’s assuming my summation from theta_weekly to theta actually did what I was after.
  2. I tried to get creative with the summation from theta_weekly to theta. Is there a way to achieve this purely with dimensions? I think the tricky part is that not every player plays every week, so I wasn’t sure how to ultimately get things to work.
  3. Does it make more sense to move the intercept from theta_weekly to theta? I kept this in the model based on the rugby example, but since I have an extra hierarchy I’m starting to think that I should move it to theta.

Fully-interactive Google Colab to help me troubleshoot, data included: link

I think a first step, just to make things a bit easier, would be to use a Standard Scaler on your data rather than a MinMax. This will center all the projected points at zero and allow for negative values, allowing you to dispense with the Truncated Normal and just use a normal likelihood function instead.

That being said, your data seems a bit odd. There is no variation in the projected points between weeks and teams. Take, for example, Detroit’s own beloved Matthew Stafford:

      week home_team away_team              name position  proj_points   home
72       1       LAR       BUF  Matthew Stafford       QB       288.44   True
177      2       LAR       ATL  Matthew Stafford       QB       288.44   True
282      3       ARI       LAR  Matthew Stafford       QB       288.44  False
387      4        SF       LAR  Matthew Stafford       QB       288.44  False
492      5       LAR       DAL  Matthew Stafford       QB       288.44   True
590      6       LAR       CAR  Matthew Stafford       QB       288.44   True
778      8       LAR        SF  Matthew Stafford       QB       288.44   True
866      9        TB       LAR  Matthew Stafford       QB       288.44  False
958     10       LAR       ARI  Matthew Stafford       QB       288.44   True
1053    11        NO       LAR  Matthew Stafford       QB       288.44  False
1152    12        KC       LAR  Matthew Stafford       QB       288.44  False
1252    13       LAR       SEA  Matthew Stafford       QB       288.44   True
1339    14       LAR        LV  Matthew Stafford       QB       288.44   True
1437    15        GB       LAR  Matthew Stafford       QB       288.44  False
1542    16       LAR       DEN  Matthew Stafford       QB       288.44   True
1647    17       LAC       LAR  Matthew Stafford       QB       288.44  False
1752    18       SEA       LAR  Matthew Stafford       QB       288.44  False

As you can see, the projected points are a deterministic function of being Matthew Stafford. This is true for all the players in all positions: once you know the player, there’s no information to extract in the “offense” or “defense” dimension, or the time dimension for that matter. To get team-level information, you would need to consider all positions, then look at a team-position hierarchy. I don’t see an easy way to get player-level information, because there is no variation at that level. For example, if you modeled Stafford as Lions + QB + Stafford + Error, the error (or Stafford) will be zero because everything not captured by Lions and QB is precisely due to being Stats Padford (in your data). Is this perhaps a subset of a larger dataset that has variation at the player level?

I suspect this is why you take only the week 1 value for each player as the observed data, but it makes the “summing over weeks” part a bit odd. Naively, I would expect that the observed data would be all 1,785 rows. As it stands, you can (and likely should) cut out the “theta_weekly” part and just use indexes to link the 105 players directly to the teams.

Much to my surprise, however, switching to the adapt_diag_grad initializer let me run your code without any divergences. That said, the ESS was extremely small (<100), so the sampler was still struggling. Given that there is no variance at the team level given the player, nor at the offense_id (given the player), I strongly strongly expected the pair_plots to be straight lines, which they are not. What the hell do I know, eh?

Nevertheless, here is one group of plots. Sigma and player_sigma appear to be deterministically connected, but I added sigma (you had it hard coded as 2). It perhaps relates to my suspicion above, that the player error and idiosyncratic observation error would be the same thing, given information about the team. I would strongly recommend you comb through these plots and look for any peculiar relationships.

One other note, I am a bit confused by how you’ve coded “defense” and “offense”. I think “offense” means “home team”? Consider the values for Tom Brady, player id 2330:

      week  player_id position  proj_points   home  offense_id  defense_id
2968     1       2330       QB       309.21  False          27           6
2995     2       2330       QB       309.21  False          27          18
6752     3       2330       QB       309.21   True          27           9
6779     4       2330       QB       309.21   True          27          12
6806     5       2330       QB       309.21   True          27           1
3022     6       2330       QB       309.21  False          27          23
3049     7       2330       QB       309.21  False          27          29
6833     8       2330       QB       309.21   True          27          33
6860     9       2330       QB       309.21   True          27          14
6887    10       2330       QB       309.21   True          27          26
3076    12       2330       QB       309.21  False          27           5
6914    13       2330       QB       309.21   True          27          18
3103    14       2330       QB       309.21  False          27          25
6941    15       2330       QB       309.21   True          27           4
3130    16       2330       QB       309.21  False          27          22
6968    17       2330       QB       309.21   True          27          29
3157    18       2330       QB       309.21  False          27           1

Tampa Bay, Brady’s team, id 27, is listed as the offense for every game in the season, even when home = False. Should these ids not be reversed when Brady is on the road? Otherwise, what does it mean for a team to be “on offense”? Perhaps I am just being confused by terms, and what you mean to capture is a given player’s performance when facing a given team? In that case it would suffice to have a single column, “opposing_team”, since there isn’t any variation in the team Brady plays for within a single season. That is, E(Brady’s Points |Brady’s Team = TB & Facing Lions) is identically equal to E(Brady’s Points | Facing Lions), since Brady’s team is always TB.

I think the way I flipped it for Stafford’s data above would capture something like home field advantage, i.e. E(Stafford’s Points | Lions on the road & TB at home).

As for intercept, I think it should be inside the exponent as your model currently stands. Assuming I’m reading things right, moving it outside would potentially cause your theta to be negative, which isn’t good. As long as it’s inside the exp, it ought to be irrelevant where you put it, because the model for the log mean is additive.

1 Like

Thanks for your thoughts, @jessegrabowski. Very good point about the season’s projected points being deterministic per-player. Unfortunately the projections are the entire dataset, so there is not variation per week/game. That’s what I was hoping to get out of this analysis - a per-week breakdown that sums to the projected points - which should be doable once the latent variables are known.

Yeah, my offense vs. defense notation was confusing. Given the interchangeability of player/team (at least at the single position level), just having an “opponent” column would suffice. However, doesn’t this break down in a position like wide-receiver where there are multiple on the field at once? In that case, the individual player may have a different contribution than the broader team’s ability at the position. Perhaps this is why you think including all positions together make the most sense. Would love to hear your additional thoughts.

I opted to exclude players that have season projections of zero. Given that my goal is to get weekly projections, they’d obviously have projections of zero each week. It shrinks the data a bit but should only help the sampler IMO.

I also changed from my summation strategy and went back to one theta as you suggested. I’m definitely getting low effective sample sizes, but I imagine that’s because there are only 1/17 unique observed values (17 games in a season).

New Colab notebook to reflect these changes: link.
Locally I only got one divergence when I ran this, so there must be something different sampler-wise in the notebook. In any case, I only have one correlated pair of variables - intercept and player_std. My guess is that the sampler is always attributing the random noise to one of the two, but is there a way to break this relationship?

I ultimately would love to put all positions into one model, but my concerns are that ex: D/ST and kicker have very different “opponents” on the field compared to the rest of the positions (which all play against the standard defense). I would definitely have a hierarchical variable for the specific positions in this case, but again I think the stark contrast in actual “opponent” for some positions would muddle the team effects. That’s why I initially am trying to do it per-position, where none of that information needs to be accounted for. Thoughts?

I think I have a better sense of what you want to do and what your data means.

I think this is possible, but only if you introduce some time series structure into your model. Essentially, you know where you want to land and you want to infer how you might get there. Your problem is that there are an infinite number of ways to go from zero to predicted_points. Here is one idea for how to do this, by predicting points each week as a function of the player and the schedule, then considering the evolution of the player’s cumulative point total over the season as a Gaussian random walk. I think this is essentially what you did in your first model.

with pm.Model(coords=coords) as model:
    opp_mu = pm.Normal('opp_mu', mu=0.0, sigma=7)
    opp_std = pm.HalfNormal("opp_std", sigma=7)
    opp_offset = pm.Normal('opp_offset', mu=0.0, sigma=7, dims='teams')
    opp = pm.Deterministic("opp", opp_mu + opp_std * opp_offset, dims="teams")
    # Player-specific parameters
    player_mu = pm.Normal("player_mu", mu=0, sigma=7)
    player_std = pm.HalfNormal("player_std", sigma=7)    
    player_offset = pm.Normal("player_offset", mu=0, sigma=player_std, dims=["players"])
    player_mean = pm.Deterministic('player_mean', player_mu + player_std * player_offset)
    # Broadcast opp to be weeks x players
    wide_opp = opp[wide_data.opponent.apply(teams_encoder.transform).values]

    # Home advantage
    beta_home = pm.Normal('home', mu=0, sigma=7)    
    wide_home = beta_home * wide_data.home.fillna(False).values.astype(float)
    weekly_mu = player_mean + wide_opp + wide_home
    # Points scored each week is an RV centered on player traits, the opponent, and home-field advantage
    # When a team has a Bye, replace the estimated weekly score by 0.0
    weekly_performance = pm.Normal('weekly_performance', mu=weekly_mu, sigma=1,  dims=['weeks', 'players'])
    weekly_performance = at.set_subtensor(weekly_performance[bye_mask], 0.0)
    # The evolution of a player's cumulative score over the season is a gaussian random walk with drift
    points_mu = pm.Deterministic('predicted_points', 
                                 dims=['weeks', 'players'])
    points_sigma = pm.HalfNormal('points_std', sigma=7)
    # Evalute the model likelihood based on where the GRW lands the player at the end of the season
    points = pm.Normal(f"points", mu=points_mu[-1, :], sigma=points_sigma, observed=final_points, dims='players')
    trace = pm.sample(init='jitter+adapt_diag_grad', target_accept=0.9, tune=2000)

The main differences are 1. using cumsum instead of sum, so we can recover the predicted trajectory of the player over the season, and 2. working with all the data in wide format, and 3. the parameters are the non-centered format, in the hopes of solving the ESS problem. For the wide data I made a little helper dataframe:

wide_data = data.set_index(['week', 'name']).unstack(1).copy()
bye_mask =

The .unstack() also ensures that every (player, week) tuple is in the data. All the rows with missing values are the byes. The predictions for the bye weeks are masked out with zeros using bye_mask.

The problem with this approach is that, again, you are squeezing blood from a rock. There’s just not much to go on. Running this model, I got this result for Stafford (this is the posterior mean):

You can see the effect of the week-to-week changes in the team or the home-field are essentially zero. I inspected the coefficients of all the opposing teams and the home_beta, and they are all essentially centered on zero with standard deviation of 7. That is, sitting on the prior values.

In these sorts of low-data regimes, you need to bake a lot more posterior information into the model in order to get anything meaningful out. For example, you could use something like relative power rankings between the teams in order to center the effect of playing against a team something more informative than zero.

Other ways to introduce some information about the time-series evolution would be to bring in historical data about team performance and, where possible, player performance. Player stats could also be used to link together players, even when historical data is not available. You could also add some information about each stadium (domed vs open), and some weather info as the season progresses (for example, fewer points are scored in the snow/heavy rain). Even just general historical statistics about the distribution of points scored for each week as the season goes on will really help you to do better than just a linear increase.

ESS isn’t a function of the number of observations, it’s a function of the geometry of your model. You have very few unique observations it’s true, but the effect of this should be to cause the priors to largely determine the shape of the posterior. ESS becomes low when the sampler gets “stuck”, for example in a region of the likelihood space where all the mass is concentrated into a very small area, like the neck of a funnel. Or when it encounters a geometric discontinuity, like a cliff. In either case, it forces the sampler to take tiny, correlated steps (everything else is rejected (bottleneck) or diverges (cliff)), which leads to a low ESS.

Low ESS isnt’ bad per-se – people run Metropolis samplers all the time with garbage ESS. But when using NUTS you’re not supposed to see low ESS, so it can be a sign that something funky is going on.

I’m not sure that intercept is identified given the player mean. If you have \mu_{\text{player}} \sim N(\mu_p, \sigma_p^2) and \mu_i = \alpha + \mu_{\text{player}} + \epsilon_i, where \alpha is the intercept term, we can re-write this as \mu_i = \alpha + \mu_p+ \sigma_p N(0, 1) + \epsilon_i. As you can see, there is nothing distinguishing \alpha and \mu_p – they are both scalars shared by all the data points – so you can choose any infinite combination of the two. This is why I eliminated intercept in the above model. We can consider the “intercept” to be fully captured by the mean of the player ability distribution.

I think you should never throw information away, especially in the position you find yourself in. You want to infer a lot from a little, so the more information you can bring in, the better.

1 Like

Wow, again, really appreciate the detailed response. I didn’t think to consider the season projection to be a weekly random walk but that is precisely what I was trying to do! And noted about the intercept and player mean being indistinguishable. Makes a lot of sense.

I tried copying your model (Colab v3: link) but was having trouble reproducing your results. Specifically, it does not look like the bye week set_subtensor is working. My Matthew Stafford (or any player, for that matter), does not have a zero for any week. I wasn’t able to spot a typo in the actual model part, so maybe it has something to do with my wide_data setup?

It looks like the sampler is still struggling at times which may be a consequence of the broad, uninformed priors like you suggested. I actually do have access to historical points per week/stats breakdowns, teams, etc. Perhaps the easiest thing to do would be to include additional seasons with actual points scored per week (I guess injuries etc. would be zero but that isn’t worth worrying about at the moment). For the wide dataframe format with historic data, do you have any tips to mask out data where ex: the player wasn’t in the league yet? Or would this approach necessitate a non-wide format? And for including the season/year component, I guess it could be as simple as another variable like you’ve set up with wide_opp and wide_home but I imagine the more “proper” strategy would be to have GRWs for those variables indexed by season and summed at the weekly_projection level - hopefully that makes sense.

Lastly, I had to include "NaN’ as a team (for bye weeks) in the teams_encoder when using this wide_data setup. Once the set_subtensor bye week problem is fixed, do you see this causing any issues? From a likelihood perspective I don’t think it would matter since we’re manually setting to zero and summing. But I imagine this would let the sampler run free for any of the team parameters for the “NaN” team. I get the sense it’s not necessarily erroneous, but just unnecessary.

Edit: Actually I think it would make sense to take the historic weekly point values and have a second likelihood on them, but since the mu of this likelihood would be constructed from the shared priors, etc. there would be more information for the inference to go on. I’ll report back later with that code.

Okay, I implemented a model that uses historic data to learn opponent, player, etc. parameters and applies a GRW to those variables. Hopefully I did it correctly - the model samples in ~10min/chain and has 0 divergences. For reference, using the non-GRW code (commented out in the notebook), the model samples in ~2min/chain.

Oh, and I still haven’t figured out the issue with bye_mask in the v3 notebook.

Colab v4: link

My experience with GRW is mixed. They can be hard to make work, but on the other hand I’ve also had them run really fast and effectively. If the GRW approach ends up being too much hassle, you could consider something simpler. For example, a simple deterministic trend, or an AR term.

I just edited my post above - I think I got the GRW to work. :slight_smile:

1 Like

Did you check the shape of wide_data.teams is indeed (weeks, players), and that the NaN rows are in the right weeks? I pre-processed your data slightly differently because I don’t like using numeric IDs for the coords, but all the shapes matched up to yours in the end.

You could also build the bye mask yourself “by hand” by e.g. looping over the players and filling in a 1 based on the bye weeks in teams_raw.

Also, I ended up with the NaN team in my traces as well, I don’t think there’s any way around it with this approach. You might be able to come up with a clever way to do this all in long format; that is what would be necessary to avoid the NaN team i think. Maybe something using block lower triangular summation matrices? If you had a long vector of estimated points P_{i,t} (i for player, t for week), you could do something like:

\begin{bmatrix}1 & 0 & 0 &0 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 \end{bmatrix} \begin{bmatrix} P_{1,1} \\ P_{1,2} \\ P_{1,3} \\ P_{2,1} \\ P_{2,2} \end{bmatrix} = \begin{bmatrix} P_{1,1} \\ P_{1,1} + P_{1, 2} \\ P_{1,1} + P_{1,2} + P_{1,3} \\ P_{2,1} \\ P_{2,1} + P_{2,2} \end{bmatrix}

So you’d just need a function that builds these triangular matrices for each player, then use scipy.linalg.block_diag to put them together, and replace the cumsum with a matrix multiplication between your (players * weeks, 1) vector of predictions, and this (player * weeks, player * weeks) cumsum matrix. This looks more complicated but it won’t require you to have a square matrix, so you can just leave away the Bye weeks without any complications.

1 Like

I was able to implement something along the lines of your suggestion using scipy.linalg.block_diag to do a cumulative sum. Great idea!

Colab v5: link

Bye week is completely out of the picture with matrix multiplication so no issues there. Had about 20 divergences in 12000 samples once I included the future/season points projection likelihood. Overall, it seems like only the home field advantage has any real impact week-to-week (at least for the future predictions).

Here are the mean opp contributions averaged across all teams by year:

years	opp
2019	-0.00047881197545493145
2020	-0.0019807311408374948
2021	0.0033330266987389346
2022	-0.0012341329620078612

and for the home field advantage, home (still in it’s original dimensions):

years	home
2019	-0.006765458408772843
2020	-0.12553699305777513
2021	0.4575872643034748
2022	1.032494508530757

Is there anything obviously wrong with my setup, or is this most likely still a consequence of insufficient data? To add more data, I could try combining all field players together and have position be another hierarchical variable. I think I would want to exclude D/ST and kicker since those two positions see a different “defense” - D/ST against the offense, kickers against a special teams unit. What do you think?