Random Walk Model - Terrible sampling but realistic inference?

I built the following model inspired from a Mitzi Morris pydata talk

with pm.Model() as m7:
packed_L_α = pm.LKJCholeskyCov('packed_L_α', n=len(team_dct),
                         eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
L_α = pm.expand_packed_triangular(len(team_dct), packed_L_α)

nu = pm.Uniform('nu', lower=1, upper=100)
α = pm.MvStudentTRandomWalk('alpha', nu=nu, shape=(len(np.unique(t)), len(team_dct)), chol=L_α)
beta = pm.Normal('beta', 0,1)

sigma_y = pm.Exponential('sigma_y', 1) # pm.Normal('sigma_y', 0, 2.5)

ability = pm.Deterministic('ability', α + beta * rank_std.values)
y = pm.Normal('y', ability[t, team1_] - ability[t, team2_], sigma_y, observed = result)
trace7 = pm.sample(1000,  tune=1000)

The model predicts match scores in Serie A (Italian Football) with a latent variable, ability, which is a teams tendency to outscore another team. My personal goal is to be able to infer teams ability and how it changed over time.

The model had 3 divergences and the traceplots look terrible, but a lot of the inference make sense - for instance, AC Milan’s tendency to outscore other teams had a huge leap that matched their performance change.

Milan’s isn’t the only one that makes sense - inference for a lot of these teams look credible and match how they’ve been performing.

My question is - how do I take this mis-specified model that might be on to something and make it more properly specified? Example notebook attached (change the extension to .ipynb)

Example Model.py (1.1 MB)

Not sure how helpful this is but I have had issues with trying to infer the degrees of freedom of a student t as part of models in the past. I suspect it can be weakly identified. My suggestion would be to try to fix nu to a value that you think is reasonable and see if that helps things!

Have you tried using a MvGaussianRandomWalk for the latent ability and StudentT for the observation?

The underlying ability of a team probably doesn’t exhibit jump like behaviour but each individual match result can be far from the expected.

You also need a bias for the home team, I guess the mean of result vector is 0.3 or so.

Really good suggestion, makes a lot of sense thank you! I ended up getting a better fit by changing to a GRW, as well as switching from a multivariate to just a no pooling Gaussian random walk and adding an AR1 term. Here’s the model

with pm.Model() as m2c:

# latent variables
team_bias = pm.GaussianRandomWalk('team_bias', sigma=1, shape=(len(np.unique(t)), len(team_dct)), )

mu_form = pm.Normal('mu_form', 0,0.5)
team_form = pm.AR1('team_form', k=mu_form, tau_e=0.1, shape=(len(np.unique(t)), len(team_dct)), )

# home team advantage
mu_home = pm.Normal('mu_home', 0.3, 0.4)
home = pm.Normal('home', mu_home,0.5, shape=len(team_dct))

# prior
sigma_y = pm.HalfNormal('sigma_y', 0.1) # match variance
ability = pm.Deterministic('ability', team_bias + team_form )

mu_match = home[team1_] + ability[t, team1_] - ability[t, team2_]
y = pm.Normal('y', mu=mu_match, sd=sigma_y, observed = result)
trace2c = pm.sample(1000,  tune=1000, return_inferencedata=True)

Two issues I’m running into:

  1. I can’t figure out why having both a RW and an AR1 process lead to the best model. Is that weird theoretically?
  2. Is there a clever way I can add a prior that limits extreme match results and add more weight to predictions around 0? Example of why my current model is problematic below

I don’t think it makes much sense that the probability of Milan scoring 4+ more goals over Parma is the same probability as them drawing or losing.

Indeed having two time series is a bit strange. I think you are over fitting to the data. I only quickly read the code but it seems like ‘t’ data is essentially per a match for each team. The AR1() process has a high standard deviation of innovations so it will essentially fit to each match result, and thus have no predictive power. You could either reduce the sigma or have the t variable apply for a few matches at a time.

As an aside you might be interested to look into pd.factorize(), this could be used rather than your custom codify() function.