Improve Sampling of Multilevel Regressions Model

AlexAndorra · January 30, 2020, 4:00pm

Hello dear PyMCommunity!

Context

As an intellectual challenge, I’m trying to model elections in Paris at the district-level (there are 20 “arrondissements”) – this is an excuse to use predictors at the district-level and the city-level, just as in the radon NB, where the uranium predictor is used at the county-level.

I put the data and model in this gist, but here is the model:

with pm.Model() as m_h_intercept_polls:
    mus_parties = []
    for p_id, p in enumerate(PARTIES[:-1]):
        city_a = pm.Normal(f"city_a_{p}", mu=-1.8, sigma=1.0)
        city_b = pm.Normal(f"city_b_{p}", mu=0.0, sigma=0.1, shape=2)
        σ_a = pm.HalfNormal(f"σ_a_{p}", 1.0)

        a = city_a + city_b[0] * city_unemp + city_b[1] * polls[:, p_id]
        za_district = pm.Normal(
            f"za_district_{p}", 0.0, 1.0, shape=(Ndistricts, Ndates)
        )
        a_district = pm.Deterministic(f"a_district_{p}", a + za_district * σ_a)
        b = pm.Normal(f"b_{p}", 0.0, 0.1, shape=2)

        # expected value per district:
        mu = a_district[district_id, date_id] + pm.math.dot(district_preds, b)

        mus_parties.append(mu)
    mus_parties = tt.as_tensor_variable(mus_parties).T

    # append last category:
    vary_pivot = tt.as_tensor_variable(
        np.full(shape=(Ndistricts, Ndates, 1), fill_value=-2.2)
    )
    mus_parties = tt.horizontal_stack(mus_parties, vary_pivot[district_id, date_id])

    # preferences of each district, for all elections:
    latent_p = pm.Deterministic("latent_p", tt.nnet.softmax(mus_parties))

    # zero-inflation process:
    # keep only preferences for available parties:
    slot_prob = parties_available * latent_p
    # normalize preferences:
    slot_prob = pm.Deterministic(
        "slot_prob", slot_prob / tt.sum(slot_prob, axis=1, keepdims=True)
    )
    R = pm.Multinomial("R", n=N, p=slot_prob, observed=R_obs)

The process can be complicated so I had to do some modeling choices that are too long to detail here but I welcome any question.

Problem

This model samples without divergences, but sampling is very long and NUTS diagnostics aren’t good:

The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

The absence of divergence is puzzling to me, as there is clearly a problem with the city-level parameters – the rest is in the gist, but here is a sample:

Questions

I tried with different priors, used a non-centered parametrization, increased target_accept and the number of tuning steps – nothing seemed to work. Does it just mean this is a case where these parameters can’t be estimated with these data? Or is there something wrong with my implementation?
The uranium example only has one obs per household. Here, I have 12 obs (i.e elections) per district, so I had to include this dimension in the model – did I do it wrong?

I’m pretty sure I’m doing something wrong but I can’t see where, as I’ve never seen this case when the sampler doesn’t converge but there are no divergences

I hope it’s clear and I’m really grateful for any help!
PyMCheers from Paris

junpenglao · February 2, 2020, 10:43am

Seems your model is unidentifiable, one of the thing jump out to me is the “double” normalization: you did softmax and then later rv/rv.sum(). Maybe something like below instead:

    # preferences of each district, for all elections:
    latent_p = parties_available * mus_parties
    # zero-inflation process:
    # keep only preferences for available parties:
    slot_prob = pm.Deterministic("slot_prob", tt.nnet.softmax(latent_p))

Also, in the latent space, the standard deviation is not very meaningful - try setting all the standard deviation to 1.

Last nitpick is that you might want to remove the for loop and use a vectorized version instead (however in this case since the number of PARTIES is not large you might not see much model improvement.)

AlexAndorra · February 2, 2020, 7:24pm

Thank you very much Junpeng – gonna try all of that tomorrow!
Just a small precision: what do you mean by

?
Also, if I set all the stds to 1, I loose the group-level std, right? But then, don’t I loose the benefits of pooling of information, shrinkage and all the hierarchical goodies?

AlexAndorra · February 9, 2020, 9:28pm

Ok… I’m back @junpenglao
I took some time to try and understand what’s going on.

About your first point: I think I have to do the double normalization, otherwise the zeros in the latent_p will turn into a probability of 0.5 after the softmax – whereas I want a probability of zero for these non-available parties.

After several trials, I noticed that abandoning the three-level structure and implementing varying-effects for the remaining two levels improved sampling considerably. Here is just the model, but I also updated the gist:

with pm.Model() as m_var_effects:
    mus_parties = []
    log_unemployment = pm.Data(
        "log_unemployment", stdz(np.log(d["unemployment"].values))
    )
    for p_id, p in enumerate(PARTIES_AGG):
        stdz_polls = pm.Data(f"{p}", stdz(d[p]).values)
        # prior stddev for itcs and slopes (i.e variation among districts):
        sd_dist = pm.Exponential.dist(3.0)
        packed_chol = pm.LKJCholeskyCov(
            f"chol_cov_p{p_id}", eta=2, n=3, sd_dist=sd_dist
        )

        # compute covariance matrix:
        chol = pm.expand_packed_triangular(n=3, packed=packed_chol, lower=True)
        cov = pm.math.dot(chol, chol.T)

        # extract rho and standard deviations:
        sigma_ab = pm.Deterministic(f"sigma_pop_p{p_id}", tt.sqrt(tt.diag(cov)))
        corr = tt.diag(sigma_ab ** -1).dot(cov.dot(tt.diag(sigma_ab ** -1)))
        r = pm.Deterministic(f"Rho_p{p_id}", corr[np.triu_indices(n=3, k=1)])

        # average itc and slopes for each party:
        if p in ["left_agg", "center_agg", "right_agg"]:
            ab = pm.Normal(
                f"ab_p{p_id}",
                mu=np.array([-1.4, 0.0, 0.1]),
                sd=np.array([0.5, 0.05, 0.05]),
                shape=3,
            )
        elif p in ["farleft_agg", "green_agg"]:
            ab = pm.Normal(
                f"ab_p{p_id}",
                mu=np.array([-1.9, 0.0, 0.1]),
                sd=np.array([0.5, 0.05, 0.05]),
                shape=3,
            )
        elif p == "farright_agg":
            ab = pm.Normal(
                f"ab_p{p_id}",
                mu=np.array([-2.7, 0.0, 0.0]),
                sd=np.array([0.5, 0.05, 0.05]),
                shape=3,
            )

        # population of varying effects:
        ab_district = pm.MvNormal(f"ab_district_p{p_id}", mu=ab, chol=chol, shape=(Ndistricts, 3))
        mus_parties.append(ab_district[district_id, 0] + ab_district[district_id, 1] * log_unemployment + ab_district[district_id, 2] * stdz_polls)
    mus_parties = tt.as_tensor_variable(mus_parties).T
    # append last category:
    vary_pivot = pm.Deterministic("vary_pivot", tt.as_tensor_variable(np.full(shape=(Ndistricts, 1), fill_value=-3.0)))
    mus_parties = tt.horizontal_stack(mus_parties, vary_pivot[district_id])

    # preferences of each district, for all elections:
    p_latent = pm.Deterministic("p_latent", tt.nnet.softmax(mus_parties))
    # zero-inflation process:
    # keep only preferences for available parties:
    p_district = parties_available * p_latent
    # renormalize preferences:
    p_district = pm.Deterministic(
        "p_district", p_district / tt.sum(p_district, axis=1, keepdims=True)
    )

    R = pm.Multinomial("R", n=N, p=p_district, observed=R_obs)

However, posterior predictive checks are quite bad. Here is just a sample:

It seems to me that the model has a hard time modeling the variations in the data (notice how the predictions are wrong, and yet very confident). I wonder if it’s because the model is “time ignorant” – i.e the latent probabilities it estimates are assumed to be constant over time.

So, would a model that estimates latent probabilities that change over time do a better job? If yes, what would I need there – a hidden Markov model? I’m really curious about your take on that!
A big thank you in advance

junpenglao · February 10, 2020, 6:30am

My comment only apply for models where you have softmax or other normalizations - as the latent parameter across group would be normalized.

The trace looks pretty good indeed - not sure how to explain the PPC - I still think it is normalization related.

AlexAndorra · February 10, 2020, 9:10am

Thanks Junpeng – helpful, as always!

But this would mean you can’t implement a hierarchical structure for GLMs, wouldn’t it? I’m quite puzzled by that implication.

I can actually test this hypothesis: the double normalization is there to take care of non-available categories. So, if I restrict the data to cases where all categories are available, then I only need one normalization and I’ll see if PPC improve – I’ll keep you posted on this experiment!

junpenglao · February 10, 2020, 10:11am

You still can - in my experience the hierarchical structure would be push to the slope instead of the noise (variance parameter of the latent hyper parameter). FWIW I think you should not set all sigma to 1, but definitively the ones at the highest level of the Graphical model.

AlexAndorra · February 11, 2020, 10:46am

Ok that’s clearer – and interesting. I was quite sad to have to give up all the hierarchical goodies

BTW, I found a way to get rid of the double normalization Sampling is somewhat better (was already good) and PPC look a lot better!
Doing some more checks and improvements and will keep you posted

junpenglao · February 11, 2020, 12:42pm

Good to hear! Next step: autocorrelated temporal parameter

AlexAndorra · February 11, 2020, 1:36pm

Ooh indeed – would it serve to make the model “time-aware”? And what would be the difference with a hidden Markov model for instance?

junpenglao · February 11, 2020, 2:17pm

No differences - you can express autoregressive model as state space model AKA continuous state spaces version of hidden markov model.

AlexAndorra · February 11, 2020, 3:52pm

Sounds really useful, I’d really like to implement that in my next model! Do you have some resources to advise?

AlexAndorra · March 7, 2020, 5:02pm

As promised, here is an update (and resolution for that matter): I managed to find a satisfying implementation of the model, which runs quite smoothly – although there are probably steps I could take to make it even more efficient. Diagnostics and posterior predictive checks look very good now!

I put all the model and data in this GH repo, and I built a Voilà web app to display the results.

Thanks again to Junpeng for his precious advice, and I welcome any comments and suggestions!
PyMCheers

tushar_chandra · March 9, 2020, 4:21am

Thank you so much for sharing! It is great to see a satisfying resolution posted here. Your notebooks look extremely clear and easy to learn from. I’m really looking forward to digging into them later.

Topic		Replies	Views
Multinomial hierarchical regression with multiple observations per group ("Bad energy issue") Questions	22	3353	March 14, 2019
Prior Predictive Sampling in a Multilevel Linear Model v5 prior , hierarchical , pymc-marketing	2	272	May 7, 2024
Hierarchical Model - Slow Sampling ( 1.8 sec per draw ) - Primary suspect is how the multi-level hierarchy is coded Questions	5	845	November 2, 2021
Simpsons paradox and mixed models - large number of groups v5	18	425	March 14, 2024
Sampling from multilevel multivariable regression model is very slow Questions	15	1386	May 19, 2020

Improve Sampling of Multilevel Regressions Model

Context

Problem

Questions

Related topics