Improve Sampling of Multilevel Regressions Model

Hello dear PyMCommunity!

Context

As an intellectual challenge, I’m trying to model elections in Paris at the district-level (there are 20 “arrondissements”) – this is an excuse to use predictors at the district-level and the city-level, just as in the radon NB, where the uranium predictor is used at the county-level.

I put the data and model in this gist, but here is the model:

with pm.Model() as m_h_intercept_polls:
    mus_parties = []
    for p_id, p in enumerate(PARTIES[:-1]):
        city_a = pm.Normal(f"city_a_{p}", mu=-1.8, sigma=1.0)
        city_b = pm.Normal(f"city_b_{p}", mu=0.0, sigma=0.1, shape=2)
        σ_a = pm.HalfNormal(f"σ_a_{p}", 1.0)

        a = city_a + city_b[0] * city_unemp + city_b[1] * polls[:, p_id]
        za_district = pm.Normal(
            f"za_district_{p}", 0.0, 1.0, shape=(Ndistricts, Ndates)
        )
        a_district = pm.Deterministic(f"a_district_{p}", a + za_district * σ_a)
        b = pm.Normal(f"b_{p}", 0.0, 0.1, shape=2)

        # expected value per district:
        mu = a_district[district_id, date_id] + pm.math.dot(district_preds, b)

        mus_parties.append(mu)
    mus_parties = tt.as_tensor_variable(mus_parties).T

    # append last category:
    vary_pivot = tt.as_tensor_variable(
        np.full(shape=(Ndistricts, Ndates, 1), fill_value=-2.2)
    )
    mus_parties = tt.horizontal_stack(mus_parties, vary_pivot[district_id, date_id])

    # preferences of each district, for all elections:
    latent_p = pm.Deterministic("latent_p", tt.nnet.softmax(mus_parties))

    # zero-inflation process:
    # keep only preferences for available parties:
    slot_prob = parties_available * latent_p
    # normalize preferences:
    slot_prob = pm.Deterministic(
        "slot_prob", slot_prob / tt.sum(slot_prob, axis=1, keepdims=True)
    )
    R = pm.Multinomial("R", n=N, p=slot_prob, observed=R_obs)

The process can be complicated so I had to do some modeling choices that are too long to detail here but I welcome any question.

Problem

This model samples without divergences, but sampling is very long and NUTS diagnostics aren’t good:

The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

The absence of divergence is puzzling to me, as there is clearly a problem with the city-level parameters – the rest is in the gist, but here is a sample:

Questions

  1. I tried with different priors, used a non-centered parametrization, increased target_accept and the number of tuning steps – nothing seemed to work. Does it just mean this is a case where these parameters can’t be estimated with these data? Or is there something wrong with my implementation?

  2. The uranium example only has one obs per household. Here, I have 12 obs (i.e elections) per district, so I had to include this dimension in the model – did I do it wrong?

I’m pretty sure I’m doing something wrong but I can’t see where, as I’ve never seen this case when the sampler doesn’t converge but there are no divergences :thinking:

I hope it’s clear and I’m really grateful for any help! :champagne:
PyMCheers from Paris :vulcan_salute:

1 Like

Seems your model is unidentifiable, one of the thing jump out to me is the “double” normalization: you did softmax and then later rv/rv.sum(). Maybe something like below instead:

    # preferences of each district, for all elections:
    latent_p = parties_available * mus_parties
    # zero-inflation process:
    # keep only preferences for available parties:
    slot_prob = pm.Deterministic("slot_prob", tt.nnet.softmax(latent_p))

Also, in the latent space, the standard deviation is not very meaningful - try setting all the standard deviation to 1.

Last nitpick is that you might want to remove the for loop and use a vectorized version instead (however in this case since the number of PARTIES is not large you might not see much model improvement.)

1 Like

Thank you very much Junpeng – gonna try all of that tomorrow!
Just a small precision: what do you mean by

?
Also, if I set all the stds to 1, I loose the group-level std, right? But then, don’t I loose the benefits of pooling of information, shrinkage and all the hierarchical goodies?

Ok… I’m back @junpenglao :champagne: :stuck_out_tongue_winking_eye:
I took some time to try and understand what’s going on.

About your first point: I think I have to do the double normalization, otherwise the zeros in the latent_p will turn into a probability of 0.5 after the softmax – whereas I want a probability of zero for these non-available parties.

After several trials, I noticed that abandoning the three-level structure and implementing varying-effects for the remaining two levels improved sampling considerably. Here is just the model, but I also updated the gist:

with pm.Model() as m_var_effects:
    mus_parties = []
    log_unemployment = pm.Data(
        "log_unemployment", stdz(np.log(d["unemployment"].values))
    )
    for p_id, p in enumerate(PARTIES_AGG):
        stdz_polls = pm.Data(f"{p}", stdz(d[p]).values)
        # prior stddev for itcs and slopes (i.e variation among districts):
        sd_dist = pm.Exponential.dist(3.0)
        packed_chol = pm.LKJCholeskyCov(
            f"chol_cov_p{p_id}", eta=2, n=3, sd_dist=sd_dist
        )

        # compute covariance matrix:
        chol = pm.expand_packed_triangular(n=3, packed=packed_chol, lower=True)
        cov = pm.math.dot(chol, chol.T)

        # extract rho and standard deviations:
        sigma_ab = pm.Deterministic(f"sigma_pop_p{p_id}", tt.sqrt(tt.diag(cov)))
        corr = tt.diag(sigma_ab ** -1).dot(cov.dot(tt.diag(sigma_ab ** -1)))
        r = pm.Deterministic(f"Rho_p{p_id}", corr[np.triu_indices(n=3, k=1)])

        # average itc and slopes for each party:
        if p in ["left_agg", "center_agg", "right_agg"]:
            ab = pm.Normal(
                f"ab_p{p_id}",
                mu=np.array([-1.4, 0.0, 0.1]),
                sd=np.array([0.5, 0.05, 0.05]),
                shape=3,
            )
        elif p in ["farleft_agg", "green_agg"]:
            ab = pm.Normal(
                f"ab_p{p_id}",
                mu=np.array([-1.9, 0.0, 0.1]),
                sd=np.array([0.5, 0.05, 0.05]),
                shape=3,
            )
        elif p == "farright_agg":
            ab = pm.Normal(
                f"ab_p{p_id}",
                mu=np.array([-2.7, 0.0, 0.0]),
                sd=np.array([0.5, 0.05, 0.05]),
                shape=3,
            )

        # population of varying effects:
        ab_district = pm.MvNormal(f"ab_district_p{p_id}", mu=ab, chol=chol, shape=(Ndistricts, 3))
        mus_parties.append(ab_district[district_id, 0] + ab_district[district_id, 1] * log_unemployment + ab_district[district_id, 2] * stdz_polls)
    mus_parties = tt.as_tensor_variable(mus_parties).T
    # append last category:
    vary_pivot = pm.Deterministic("vary_pivot", tt.as_tensor_variable(np.full(shape=(Ndistricts, 1), fill_value=-3.0)))
    mus_parties = tt.horizontal_stack(mus_parties, vary_pivot[district_id])

    # preferences of each district, for all elections:
    p_latent = pm.Deterministic("p_latent", tt.nnet.softmax(mus_parties))
    # zero-inflation process:
    # keep only preferences for available parties:
    p_district = parties_available * p_latent
    # renormalize preferences:
    p_district = pm.Deterministic(
        "p_district", p_district / tt.sum(p_district, axis=1, keepdims=True)
    )

    R = pm.Multinomial("R", n=N, p=p_district, observed=R_obs)

However, posterior predictive checks are quite bad. Here is just a sample:


It seems to me that the model has a hard time modeling the variations in the data (notice how the predictions are wrong, and yet very confident). I wonder if it’s because the model is “time ignorant” – i.e the latent probabilities it estimates are assumed to be constant over time.

So, would a model that estimates latent probabilities that change over time do a better job? If yes, what would I need there – a hidden Markov model? I’m really curious about your take on that!
A big thank you in advance :vulcan_salute:

My comment only apply for models where you have softmax or other normalizations - as the latent parameter across group would be normalized.

The trace looks pretty good indeed - not sure how to explain the PPC - I still think it is normalization related.

1 Like

Thanks Junpeng – helpful, as always!

But this would mean you can’t implement a hierarchical structure for GLMs, wouldn’t it? I’m quite puzzled by that implication.

I can actually test this hypothesis: the double normalization is there to take care of non-available categories. So, if I restrict the data to cases where all categories are available, then I only need one normalization and I’ll see if PPC improve – I’ll keep you posted on this experiment! :man_scientist:

You still can - in my experience the hierarchical structure would be push to the slope instead of the noise (variance parameter of the latent hyper parameter). FWIW I think you should not set all sigma to 1, but definitively the ones at the highest level of the Graphical model.

1 Like

Ok that’s clearer – and interesting. I was quite sad to have to give up all the hierarchical goodies :cry:

BTW, I found a way to get rid of the double normalization :tada: Sampling is somewhat better (was already good) and PPC look a lot better!
Doing some more checks and improvements and will keep you posted :wink:

Good to hear! Next step: autocorrelated temporal parameter :wink:

1 Like

Ooh indeed – would it serve to make the model “time-aware”? And what would be the difference with a hidden Markov model for instance?

No differences - you can express autoregressive model as state space model AKA continuous state spaces version of hidden markov model.

2 Likes

Sounds really useful, I’d really like to implement that in my next model! Do you have some resources to advise?