Prior for Hierarchical Negative Binomial Regression

nrieger · April 10, 2025, 10:06am

Hey everyone,
I could use a bit of help from the community on a model convergence issue.

I’m working with a simple Negative Binomial (NB) regression. Everything runs smoothly with the pooled model, but when I switch to a hierarchical NB setup, convergence becomes unstable, Rhat values are consistently above 1.00.

I’ve done prior predictive checks and things seem okay, but my gut* tells me something might be off with how I’ve specified the hierarchical priors.

Does anyone see any red flags in my prior setup that could be messing with convergence?

Here’s a simplified model diagram and the relevant code. Happy to provide a full minimal working example with data if needed!

with pm.Model(coords=coords) as model:
    # Set the data
    subdivision_idx = pm.Data("subdivision_idx", data["subdivision_idx"].values, dims="survey")
    obs = pm.Data("obs", data["n_items"].values, dims="survey")

    # Priors
    nu = pm.Normal("nu", np.log(150), 1.5)  # <- I suppose the error here...
    kappa = pm.HalfNormal("kappa", 0.2)  # <- ... or here
    log_mu = pm.Normal("log_mu", nu, kappa, dims="group")
    mu = pm.Deterministic("mu", pt.tensor.exp(log_mu))

    # Likelihood
    observed = pm.NegativeBinomial(
        "litter",
        mu=mu[subdivision_idx],
        alpha=1,
        observed=obs,
    )

Looking at the ESS, kappa seems to be the problem here…

Thanks in advance! Appreciate any ideas or insights.

(*) For context, I ran into a similar issue recently when using a hierarchical prior for the dispersion parameter alpha, and switching to modeling 1/alpha instead helped a lot.

Dekermanjian · April 10, 2025, 11:23am

Hi, @nrieger!

Have you tried standardizing/scaling your data? It should improve the convergence when sampling and after sampling you can always recover your estimates on the original scale.

nrieger · April 10, 2025, 11:38am

Thanks for the suggestion, @Dekermanjian! I did consider that, but I’m not sure it makes sense in my case. Since I’m working with count data, scaling it directly could mess with one of its key properties—namely, that it’s discrete.

That said, because Negative Binomial regression uses a log link, the data is effectively scaled in the model. In my case, log(mu) falls roughly between 3 and 7, and I’ve tried to reflect that in how I set up my priors.

Dekermanjian · April 10, 2025, 12:53pm

Okay, I understand. I also noticed that you are setting alpha in the NegativeBinomial likelihood to a static value of 1. Is there a particular reason why you are doing this? I would try setting a prior on the alpha something like pm.Exponential(lam = 1/10)

jessegrabowski · April 10, 2025, 1:13pm

You can try a non-centered parameterization for your hierarchy. This is often nicer to sample. A zero-avoiding prior on kappa, like Gamma(2,1), can also help, although you are technically making a strong assumption when you do that (that the the different sub-divisions must be at least somewhat different. Estimating sigma = 0 on the hierarchical parameter is a valid result, it rejects the present of hierarchy (e.g. all units are the same). In practice, I find that a very small but non-zero sigma is good enough.)

nrieger · April 10, 2025, 2:52pm

Thanks @Dekermanjian and @jessegrabowski! The non-centered parametrization was exactly what I needed. I also tried using a Gamma prior, but it didn’t help, and since one of my goals is to test whether the hierarchy is even necessary, it wasn’t ideal anyway.

@Dekermanjian: I fixed alpha just to keep the example simple for my actual model, it’s inferred too.

bwengals · April 10, 2025, 9:30pm

The prior on the negative binomial overdispersion parameter can be a bit dicey. Take a look at this. The penalized complexity (PC) prior has worked well for me. There’s an implementation in this gist.

nrieger · April 11, 2025, 8:00am

Thanks @bwengals! I actually came across that STAN page on prior recommendations earlier, it helped me fix my initial issue with alpha (I ended up using 1/sqrt(alpha), which samples much more robustly now). Your implementation of the PC prior looks really interesting too! I’ll definitely give it a try!

nrieger · April 14, 2025, 11:18am

hey @bwengals I just went through the Simpson paper on the PC prior and your notebook, really fantastic stuff! I think I’ve got the main idea: you’re using KL divergence to measure how far an over-dispersed model (with dispersion parameter \alpha) deviates from a base Poisson model. That makes sense to me.

Where I’m getting a bit stuck is in how you set the two free parameters that define the PC prior. In your notebook, you write:

\alpha \sim 1 / \text{Weibull}(x, a=0.5, b=1/\lambda^2)
I follow that part, but I’m not sure how you determine $lambda$. You define it as:

-\log(\epsilon) / d_{KL}(L^{-1})

Could you help me interpret what \epsilon and L represent in this context? From your inset plot, it seems like L might set the mode of \alpha, and \epsilon controls how much prior mass falls below that mode—but that’s just my guess.

How do you typically choose \epsilon and L in practice, based on your data or modeling goals? Any tips or references would be super helpful!

nrieger · April 15, 2025, 4:15pm

OK i think I got it and my initial guess is correct. Rereading the original paper, I see that the parameters L and \epsilon match up with U and \alpha in the Simpson paper. So then according to Equation (3.2),
P(\alpha > L) = \epsilon \quad \implies \quad \exp(-\lambda L) = \epsilon \quad \implies \quad \lambda = -\frac{\ln(\epsilon)}{L}.

bob-carpenter · April 15, 2025, 4:53pm

Dan Simpson et al.'s paper is great. Here’s the final version (we’re linking a draft in the Stan priors recommendations).

The basic idea is that you have a simple model and a complex model and you scale them both to constant variance then interpolate between them and rescale for the actual variance. The problem for me is figuring out the math to get the scaling of the base distributions right.

I hadn’t realized someone had worked out the negative binomial vs. Poisson. I always just penalize the over dispersion parameter, but the negative binomial is hard to fit because there are two ways to explain high values: higher mean or more dispersion. That leads to a lot of correlation in the posteriors, even with a strong shrinkage prior.

Topic		Replies	Views
Hierarchical Linear Regression Model -- why is it not converging? Questions	3	1558	March 14, 2018
Hierarchical Binomial Model - Not converging Questions	2	886	February 1, 2019
Hierarchical Binomial regression - ineffective sampling Questions	4	900	January 17, 2018
Improving model convergence and sampling speed Questions	8	6248	September 27, 2019
Non-centered multivariate normal Questions	7	1265	January 18, 2024

Prior for Hierarchical Negative Binomial Regression

Related topics