Zero Inflated Poisson model with aggregated data


I want to model the amount of goals scored by a player in a football (American soccer) match using zero inflated Poisson model. The problem is that I can not just model the probability of playing in a specific match because player may be playing different amount of minutes each match. So I was thinking about modelling p as the probability of playing in a given minute.

Here’s how my data looks like

| minutes_played | goals |
| 56             |   1   |
| 90             |   2   |
| 0              |   0   |

So here is how I try to model it:

with pm.Model() as zip_model:

a_mu = pm.Normal('au', 0, 1)
a_sigma = pm.Exponential('a_sigma', 1)
ap = pm.Normal('ap', a_mu, a_sigma)
p = pm.math.invlogit(ap)

y = pm.Binomial('y', p = p, n = 90, observed = data.minutes_played) #Probability of playing in a given minute

al = pm.Normal('al', 0, 1)
lambda_ = pm.math.exp(al)

goals = pm.ZeroInflatedPoisson('goals', psi = pm.math.invlogit(y), theta = lambda_, observed = data.goals)

trace = pm.sample(draws = 4000, tune = 1000)

Problem is I get a warning for effective samples lower than 200 for some parameters (how bad is that?) and that for after tuning the were ~450 divergences…

Am I missing something with the model?
because the simulation results seems a bit off.

sim_goals = np.random.binomial(n = 1,p = logistic(trace['ap'])).reshape(-1,1)*np.random.poisson(np.exp(trace['al']))



I don’t think that is the best way to define the generating nature of your data. My proposal would be to make use of the Poisson parameter lam since it is by definition a rate.

Let’s suppose that you are modeling two players (from what you posted as an example it looked like you just have one, but I’m assuming that the goal is to model more players).

# define parameters
prob_not_scoring = [0.3, 0.2] # games with no goals
rate_goals = [1.0, 2.]  # average goals a game

# sample games
N = 100
num_players = 2

# simulate games no goals
no_goals = np.random.binomial(1, prob_not_scoring, size=(N,num_players))

# simulate goals
y = (1 - no_goals) * np.random.poisson(rate_goals, size=(N, num_players))


We generated our data. Now, let’s make use of the Poisson rate by defining the time played and using it as the denominator (you’ll see this as a subtraction because I’m using the log, but it is the denominator in the natural scale).

y_f = y.T.flatten().reshape(-1,1).T[0]
players = np.repeat((np.arange(0,len(y.T))), N).reshape(-1,1).T[0]
time_played = np.round(stats.uniform.rvs(0, 90, size=len(y_f)),0)

log_time_played = np.log(time_played)

# fit the model:
with pm.Model() as m:
    a = pm.Normal("a", 0.0, 1.0)
    b = pm.Normal("b", 0.0, 1.0)
    lam = pm.math.exp(log_time_played + a + b*players)
    psi = pm.Beta('psi', 1, 1)
    y_obs = pm.ZeroInflatedPoisson("y_obs", psi, lam, observed=y_f)

    trace = pm.sample(2000, tune=2000)

Now, you can have a look at the parameters for both players:

lambda_player1 = np.exp(trace["a"])*90
lambda_player2 = np.exp(trace["a"] + trace["b"])*90

    {"lambda_p1": lambda_player1, "lambda_p2": lambda_player2}, kind="stats", round_to=2

Hope it helps!

1 Like

Hi luisroque,

Thanks a lot for your help.
I couldn’t understand what players exactly is or what it should represent ?
If I understand correctly, then psi represents the probability of a player playing in a match, if so, I think we’re modelling probabilities of independent events together (as it is probability of two different players) ?

at the end , I think I’m still missing something because even with your model, the estimated values for lambda_1, lambda_2 is 2.56, 4.23 which is far off from the actual values (1, 2)


The array players represents each player that we are analyzing, it is 0 for the first player and 1 for the second player.

No, psi is the probability of not scoring in a match (it is only important for the zero-inflated part of the model).

They are not off, remember that when we defined time_played as a uniform dist from 0 to 90, the average number of goals that each player scored per 90 minutes is much higher than 1 or 2 (because they played much less than 90 minutes for each of the 100 games!).

1 Like