I am currently trying to make a sports analytics model that uses the results from the regular season as the prior for some of the parameters for the playoffs. I am using the model by Baio and Blangiardo following one of the PyMC3 examples.
My attempt at this follows the KDE linear interpolation example; however, the model I am using is multi-level and I am unsure as to how to perform the KDE linear interpolation with a random-variable that has multiple indexes.
Here is a snippet for more context:
with pm.Model() as regular_season_model:
# global model parameters
home = pm.Flat('home')
sd_att = pm.Exponential('sd_att', lam=10)
sd_def = pm.Exponential('sd_def', lam=10)
intercept = pm.Flat('intercept')
# team-specific model parameters
atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, shape=num_teams)
defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, shape=num_teams)
atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))
home_theta = tt.exp(intercept + home + atts[df_s.home_team_id] + defs[df_s.away_team_id])
away_theta = tt.exp(intercept + atts[df_s.away_team_id] + defs[df_s.home_team_id])
# likelihood of observed data
home_points = pm.Poisson('home_points', mu=home_theta, observed=df_s.home_goals)
away_points = pm.Poisson('away_points', mu=away_theta, observed=df_s.away_goals)
regular_season_trace = pm.sample(2000, tune=2000, cores=3)
def from_posterior(param, samples, k=100):
smin, smax = np.min(samples), np.max(samples)
width = smax - smin
x = np.linspace(smin, smax, k)
y = stats.gaussian_kde(samples)(x)
# what was never sampled should have a small probability but not 0,
# so we'll extend the domain and use linear approximation of density on it
x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
y = np.concatenate([[0], y, [0]])
return pm.Interpolated(param, x, y)
with pm.Model() as playoff_model:
# global model parameters
home = pm.Flat('home')
sd_att = from_posterior('sd_att', regular_season_trace['sd_att'])
sd_def = from_posterior('sd_def', regular_season_trace['sd_def'])
intercept = from_posterior('intercept', regular_season_trace['intercept'])
# team-specific model parameters
# not sure how to code these priors???
atts_star = from_posterior('atts_star', regular_season_trace['atts_star'])
defs_star = from_posterior('defs_star', regular_season_trace['defs_star'])
# old way
#atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, shape=num_teams)
#defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, shape=num_teams)
atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))
home_theta = tt.exp(intercept + home + atts[df_p.home_team_id] + defs[df_p.away_team_id])
away_theta = tt.exp(intercept + atts[df_p.away_team_id] + defs[df_p.home_team_id])
# likelihood of observed data
home_points = pm.Poisson('home_points', mu=home_theta, observed=df_p.home_goals)
away_points = pm.Poisson('away_points', mu=away_theta, observed=df_p.away_goals)
What I want is to use āinformed priorsā for the team strengths from the regular season, since the playoff sample is rather small to refit team strengths (e.g. a team that loses in the first round of playoffs gives a very small sample, and teams that go further give a larger but still pretty small sample), but some variables I still want to fit anew (e.g. home, one of the things I want to see is if there is a change in home advantage, and the total number of games is a relatively decent sized sample for an overall param such as this).
My initial thoughts are to create separate variables for each atts_star and defs_star for each team and to individually call from_posterior() on each one, but I am unsure if this brute-force would even work (would it lose the ānestedā/āgroupedā quality of the multi-level partial pooling?) and I am wondering if there is a better way that keeps them as one variable with an index for each team. This also makes the from_posterior() function confusing, because I need a different KDE for each index of the random variable, but I donāt think there is a vectorized KDE. Instead the multi-dimensional KDE would fit a multi-variate distribution which would not be the desired behaviour here. Needless to say I am a bit lost and confused.
Any help/advice/insight would be greatly appreciated. Thanks.