How would I model individual goals scored as a parameter of total team goals scored in a PyMC3 regression model?

Hi how’s it going?

I’m trying to model individual goals scored, and realized that sometimes my model is returning predictions that wouldn’t be possible in the real world. For example, because Barcelona has so many great players, it may return that in total they will score 6-7 goals in the game if I add up the individual predictions for their team.

Here’s a sample dataset… and bare bones regression model for some context.

df = pd.DataFrame({'team':['Real Madrid','Real Madrid','Real Madrid','Barcelona','Barcelona','Barcelona'],\
                   'team_goals':[3,3,3,4,4,4],\
                  'player':['Karim Benzema','Luka Modric','Sergio Ramos','Lionel Messi','Luis Suárez','Antoine Griezmann'],\
                   'average_player_goal_per_game':[.58,.09,.27,.67,.34,.36],\
                  'player_goals':[2,1,0,1,2,1]})

Screenshot from 2020-08-02 13-08-36

x = df['average_player_goal_per_game']
y = df['player_goals']

with pm.Model() as goals_model:
    a = Normal("a", 0, 1)
    bA = pm.Normal("bA",0, 1)

    sigma = pm.Uniform("sigma", 0,1)
    
    mu = pm.Deterministic("mu", a + bA*x)
    goals = Normal(
        "goals", mu=mu, sigma=sigma, observed=y.values
    )
    trace_goals= pm.sample()

My goal is to make some sort of constraint so that the sum of the output of goals for any given team has a ceiling that is defined by a distribution on the ‘team_goals’ column.

If anyone can help me or point me in the right direction, I’d greatly appreciate it.

My first thought is to model team goals, and then use a multinomial model to ‘divvy up’ the team’s goals to the players. Is there a reason you want to go in the other direction, so to speak, modeling the players and using that to estimate the teams?

One reason the sum of the expected player-level goals is biased upward is that teams take their foot off the gas when they are leading (see my_blog_post). You could test this hypothesis by looking at the average goals scored by (e.g.) Ramos conditional on how many goals Benzema has.

3 Likes

No reason at all to go the other way! Thanks for the advice. I’m a bit self taught and new to PyMC3/Bayesian Inference, do you know of any good articles/resources that provide a guide to creating a multinomial hierarchical model that actually divvys up the first level predictions into second level predictions?

I used a hierarchical multinomial model to predict elections in Paris this year at the district-level. This is of course not a sports model, but maybe you’ll find something interesting and it’ll help you understand concepts :man_shrugging:

1 Like

Ok cool thank you so much!

1 Like