Random Walk Model - Terrible sampling but realistic inference?

I built the following model inspired from a Mitzi Morris pydata talk

with pm.Model() as m7:
packed_L_α = pm.LKJCholeskyCov('packed_L_α', n=len(team_dct),
                         eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
L_α = pm.expand_packed_triangular(len(team_dct), packed_L_α)

nu = pm.Uniform('nu', lower=1, upper=100)
α = pm.MvStudentTRandomWalk('alpha', nu=nu, shape=(len(np.unique(t)), len(team_dct)), chol=L_α)
beta = pm.Normal('beta', 0,1)

sigma_y = pm.Exponential('sigma_y', 1) # pm.Normal('sigma_y', 0, 2.5)

ability = pm.Deterministic('ability', α + beta * rank_std.values)
y = pm.Normal('y', ability[t, team1_] - ability[t, team2_], sigma_y, observed = result)
trace7 = pm.sample(1000,  tune=1000)

The model predicts match scores in Serie A (Italian Football) with a latent variable, ability, which is a teams tendency to outscore another team. My personal goal is to be able to infer teams ability and how it changed over time.

The model had 3 divergences and the traceplots look terrible, but a lot of the inference make sense - for instance, AC Milan’s tendency to outscore other teams had a huge leap that matched their performance change.

Milan’s isn’t the only one that makes sense - inference for a lot of these teams look credible and match how they’ve been performing.

My question is - how do I take this mis-specified model that might be on to something and make it more properly specified? Example notebook attached (change the extension to .ipynb)

Example Model.py (1.1 MB)

1 Like

Not sure how helpful this is but I have had issues with trying to infer the degrees of freedom of a student t as part of models in the past. I suspect it can be weakly identified. My suggestion would be to try to fix nu to a value that you think is reasonable and see if that helps things!

1 Like

Have you tried using a MvGaussianRandomWalk for the latent ability and StudentT for the observation?

The underlying ability of a team probably doesn’t exhibit jump like behaviour but each individual match result can be far from the expected.

You also need a bias for the home team, I guess the mean of result vector is 0.3 or so.

1 Like

Really good suggestion, makes a lot of sense thank you! I ended up getting a better fit by changing to a GRW, as well as switching from a multivariate to just a no pooling Gaussian random walk and adding an AR1 term. Here’s the model

with pm.Model() as m2c:

# latent variables
team_bias = pm.GaussianRandomWalk('team_bias', sigma=1, shape=(len(np.unique(t)), len(team_dct)), )

mu_form = pm.Normal('mu_form', 0,0.5)
team_form = pm.AR1('team_form', k=mu_form, tau_e=0.1, shape=(len(np.unique(t)), len(team_dct)), )

# home team advantage
mu_home = pm.Normal('mu_home', 0.3, 0.4)
home = pm.Normal('home', mu_home,0.5, shape=len(team_dct))

# prior
sigma_y = pm.HalfNormal('sigma_y', 0.1) # match variance
ability = pm.Deterministic('ability', team_bias + team_form )

mu_match = home[team1_] + ability[t, team1_] - ability[t, team2_]
y = pm.Normal('y', mu=mu_match, sd=sigma_y, observed = result)
trace2c = pm.sample(1000,  tune=1000, return_inferencedata=True)

Two issues I’m running into:

  1. I can’t figure out why having both a RW and an AR1 process lead to the best model. Is that weird theoretically?
  2. Is there a clever way I can add a prior that limits extreme match results and add more weight to predictions around 0? Example of why my current model is problematic below

I don’t think it makes much sense that the probability of Milan scoring 4+ more goals over Parma is the same probability as them drawing or losing.

Indeed having two time series is a bit strange. I think you are over fitting to the data. I only quickly read the code but it seems like ‘t’ data is essentially per a match for each team. The AR1() process has a high standard deviation of innovations so it will essentially fit to each match result, and thus have no predictive power. You could either reduce the sigma or have the t variable apply for a few matches at a time.

As an aside you might be interested to look into pd.factorize(), this could be used rather than your custom codify() function.