Divergences in varying intercepts hierarchical model

I have a small dataset that’s relatively simple, but fitting a partial pooling model is running into a lot of divergences, even when trying to include non-centering (pymc3 version = 3.8).

I really want to leverage partial pooling since I have more features to include (but I wanted to start simple with just 3 features since I had divergences).

def hierarchical_normal(name, shape, μ=0., sig=1., centered=False):

if not centered:
    Δ = pm.Normal('Δ_{}'.format(name), 0., 1., shape=shape)
    σ = pm.HalfCauchy('σ_{}'.format(name), sig)
    return pm.Deterministic(name, μ + Δ * σ)

    mu = pm.Normal('μ_{}'.format(name), μ, 1)
    σ = pm.HalfCauchy('σ_{}'.format(name), sig)
    return pm.Normal('α_{}'.format(name), mu, σ, shape=shape)

with pm.Model() as model1:
    β0 = pm.Normal('β0', 0., 1., testval=df.yes.sum()/df['N'].sum())
    α_age = hierarchical_normal('age', n_age, sig=3, centered=False)
    α_income = hierarchical_normal('income', n_income, sig=3, centered=False)
    α_gender = hierarchical_normal('gender', n_gender, sig=3, centered=False)

    η = β0 \
        + α_age[age_]\
        + α_income[income_]\
        + α_gender[gender_]

    p = pm.Deterministic('p', pm.math.invlogit(η))
    obs = pm.Binomial('obs', df.N.values, p, observed=df.yes.values)

with model1:
    trace = pm.sample(

I’m surprised to see it fitting so poorly since its such a simple dataset.

trace plots below

And full code attached as a .py file

test.py (5.8 KB)

I attempted to dig a little further and attempt to specify some correlation between income-age and income-gender ( :frowning: ) to hopefully help the model fit a little better and it turned out to be a much worse fit.

In some sense, I get it, because I now have 235 params for a 60 entry dataset - but on the other hand, Its aggregated data so its not really 60 entries its more like 205,000 observations - shouldn’t this not be a big deal?

with pm.Model() as model3:

# fixed priors
g = pm.Normal("g", mu=0.0, sd=1.0, shape=n_income)
sd_dist = pm.Exponential.dist(1.0)
chol_age, _, _ = pm.LKJCholeskyCov(
    "chol_age", n=n_income, eta=4, sd_dist=sd_dist, compute_corr=True
chol_gender, _, _ = pm.LKJCholeskyCov(
    "chol_gender", n=n_income, eta=4, sd_dist=sd_dist, compute_corr=True

# adaptive priors centered
#     α_age = pm.MvNormal("α_age", mu=0.0, chol=chol_income, shape=(n_age, n_income))
#     α_gender = pm.MvNormal("α_gender", mu=0.0, chol=chol_gender, shape=(n_gender, n_income))

# adaptive priors non-centered
z_age = pm.Normal("z_age", 0.0, 1.0, shape=(n_income, n_age))
α_age = pm.Deterministic("α_age", pm.math.dot(chol_age, z_age) )
z_gender = pm.Normal("z_gender", 0.0, 1.0, shape=(n_income, n_gender))
α_gender = pm.Deterministic("α_gender", pm.math.dot(chol_gender, z_gender) )

# linear equation
η = g[income_] + α_age[income_, age_] + α_gender[income_, gender_]
p = pm.Deterministic('p', pm.math.invlogit(η) )

# Fit
obs = pm.Binomial('obs', df.N.values, p, observed=df.conversion.values)

trace3 = pm.sample(

Hopefully someone with more experience/understanding of Bayesian modeling will step-in, but I bet the problem is that your covariates are highly correlated. In particular, I would think age and income are very strongly positively correlated. Does it help if you try to remove one or add more informative priors? Perhaps, try plotting scatter plots of the posterior samples from the MCMC - that should indicate quite clearly if there is high correlation between your parameters.

Just wondering why you have such a large burn in relative to draws, why you use advi rather than default, and why you increased target so much? (I assume the last is because of divergences?)

It Turns out dropping the gender income interaction and instead encoding gender as a fixed effect, along with 10000 samples and target_accept=0.95 lead me to no divergences! The model still fits a little slower than I’d like but the posterior predictive checks look somewhat reasonable. I’m still curious if there’s a better model specification however

@bayes The large burn in is to make sure the parameters space is properly explored first, and I’ve noticed advi initialization helps avoid chain failures and bad initialization when I try models with only varying intercepts