Negative Binomial Multi-level Regression

I have been playing with a model similar to the rugby example where the log of the expectation is modelled as a linear combination of factors (home advantage as well as team offense and defense strength in this case). The thing is I have data that is typically overdispersed (the variance is greater than the mean, often around 2-5x the mean) and my understanding is that a Negative Binomial distribution would better represent the data than Poisson.

I have found that I get better fits to the data (as measured by WAIC and loo) when using the Negative Binomial distribution, but I feel that the way I am defining alpha is quite arbitrary and I don’t know how to interpret the parameter after fitting. So currently I set up the model like in the rugby example except I have simply changed the likelihood from Poisson to Negative Binomial, and then added a prior for an alpha variable. At first I simply used very broad priors such as HalfNormal(sigma=50) and Uniform(0, 100), which seemed to converge fine and produced a fit better than Poisson. Then I tried setting the likelihood to NegativeBinomial(mu=mu, alpha=alpha*mu), so that now alpha is a fraction/multiple of the mean. This results in noticeably better fits (according to WAIC and loo) but its starting to feel very ad-hoc and I don’t really know any good reason/theory/explanation/interpretation for this.

So I am wondering if anyone has any experience fitting multilevel models to overdispersed count data with Negative Binomial likelihoods, and how they went about specifying and then interpreting the alpha parameter. Any relevant books/papers/articles I should read up on?

Here is some code to help make the model I described above a little more clear:

with pm.Model() as model:
    sd_att = pm.HalfNormal('sd_att', sigma=1)
    sd_def = pm.HalfNormal('sd_def', sigma=1)
    home = pm.Normal('home', mu=0, sigma=1)
    intercept = pm.Flat('intercept')
    alpha = pm.Uniform('alpha', 0, 100)

    atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, shape=num_teams)
    defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, shape=num_teams)
    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))
    home_mu = tt.exp(intercept + home + atts[df.home_team_id] + defs[df.away_team_id])
    away_mu = tt.exp(intercept + atts[df.away_team_id] + defs[df.home_team_id])

    home_points = pm.NegativeBinomial('home_points', mu=home_mu, alpha=home_mu*alpha, observed=df.home_team_pts)
    away_points = pm.NegativeBinomial('away_points', mu=away_mu, alpha=away_mu*alpha, observed=df.away_team_pts)

    trace = pm.sample(2000, tune=1000, cores=4)