# Multilevel model specification

I’m trying to use a hierarchical model to analyze my data, but I’m running into a lot of divergences.
I’m basing the model specification on a blog post by Thomas Wiecki (https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/). Also see a full working example of my use case below.

I don’t really understand the reason for sigma_a and sigma_b in this specification. The blog post says they can indicate confidence in the offsets from the mean, but I think that’s also indicated by the distribution on those actual offsets. Further, it seems to me that they will correlate with a_offset and b_offset.
For example, sigma_a could increase a lot as long as a_offset is also shrinking towards 0. When looking at the traces the sigmas also seem to have a similar funnel-tendency as described in the blog post.
If I remove the sigmas from the model specification, the problem with divergences disappear so this seems to be at least part of the problem. What are sigma_a and sigma_b useful for and what am I giving up by removing them?

EDIT FOR CLARIFICATION:
Note that the model is parametrized as:
`b = mu_b + b_offset * sigma_b`
And analogous for `a` . In this case it seems like `b_offset` and `sigma_b` are tightly coupled. Say that the offset should be 1, then that can be modelled either as `b_offset = 1, sigma_b = 1` , or as `b_offset = 1000, sigma_b = 0.001` . This seems like an identity problem and I don’t see what in the model makes `sigma_b` higher or closer to 1 as we gain more confidence.

Full code example:

``````import pandas as pd
import pymc3 as pm

data = pd.DataFrame({
'log_sales': [1.94591015, 2.39789527, 1.94591015, 2.39789527, 1.79175947, 2.48490665,
1.94591015, 2.30258509, 1.94591015, 2.77258872, 1.60943791, 1.94591015,
1.38629436, 1.38629436, 0.69314718, 1.09861229],
'log_price': [2.30258509, 2.30258509, 2.30258509, 2.30258509, 2.39789527, 2.39789527,
2.39789527, 2.39789527, 1.79175947, 1.79175947, 1.79175947, 1.79175947,
2.07944154, 2.07944154, 2.07944154, 2.07944154],
'product_id': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
})
n_products = 2

with pm.Model() as hierarchical_model_non_centered:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
sigma_a = pm.HalfCauchy('sigma_a', 5)
mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
sigma_b = pm.HalfCauchy('sigma_b', 5)

# Transformed:
a_offset = pm.Normal('a_offset', mu=0, sd=1, shape=n_products)
a = pm.Deterministic("a", mu_a + a_offset * sigma_a)

b_offset = pm.Normal('b_offset', mu=0, sd=1, shape=n_products)
b = pm.Deterministic("b", mu_b + b_offset * sigma_b)

# Model error
eps = pm.HalfCauchy('eps', 5)

sales_est = a[data.product_id.values] + b[data.product_id.values] * data.log_price.values

# Data likelihood
sales_like = pm.Normal('sales_like', mu=sales_est, sd=eps, observed=data.log_sales)

hierarchical_non_centered_trace = pm.sample(chains = 2, draws=2000, tune=1000, target_accept = 0.9)
``````

Hey, micke. I have thought about this before, what `sigma` and `mu` separately mean within a single distribution prior, in a larger model, and the answer I have come to is that while the latter tries to get you the right answer, the former accounts for the uncertainty that is “inherent” to all estimation.

Even OLS, the simplest of models, always gives a confidence to the estimates it spits back, so do you think your model here will fundamentally be working correctly, if you try to remove either sigma_a or sigma_b?

What does your code look like when you remove them? I would also try playing around with both the prior and sampler settings.

Thanks for your answer. So if I’m getting your explanation correctly it sounds like some kind of catch all for uncertainty that may be in the estimation? I’ll need to think about how that actually manifests in the sampling here because I’m still not getting how the uncertainty can be modelled like that: what effect makes the sigma lower if we are more uncertain and wouldn’t mu just increase as a reaction then?

Whether my model works correctly when removing the sigmas is what I’m trying to figure out and I’m not sure exactly how to go about that, hence my question about their relevance.

I need to remove both of the sigmas to avoid the divergences and also have to increase target_accept to 0.9. This is what the code looks like then. I’ve just commented out the parts that are removed to make it easy to compare:

``````import pandas as pd
import pymc3 as pm

data = pd.DataFrame({
'log_sales': [1.94591015, 2.39789527, 1.94591015, 2.39789527, 1.79175947, 2.48490665,
1.94591015, 2.30258509, 1.94591015, 2.77258872, 1.60943791, 1.94591015,
1.38629436, 1.38629436, 0.69314718, 1.09861229],
'log_price': [2.30258509, 2.30258509, 2.30258509, 2.30258509, 2.39789527, 2.39789527,
2.39789527, 2.39789527, 1.79175947, 1.79175947, 1.79175947, 1.79175947,
2.07944154, 2.07944154, 2.07944154, 2.07944154],
'product_id': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
})
n_products = 2

with pm.Model() as hierarchical_model_non_centered:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
#sigma_a = pm.HalfCauchy('sigma_a', 5)
mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
#sigma_b = pm.HalfCauchy('sigma_b', 5)

# Transformed:
a_offset = pm.Normal('a_offset', mu=0, sd=1, shape=n_products)
a = pm.Deterministic("a", mu_a + a_offset) # * sigma_a)

b_offset = pm.Normal('b_offset', mu=0, sd=1, shape=n_products)
b = pm.Deterministic("b", mu_b + b_offset) # * sigma_b)

# Model error
eps = pm.HalfCauchy('eps', 5)

sales_est = a[data.product_id.values] + b[data.product_id.values] * data.log_price.values

# Data likelihood
sales_like = pm.Normal('sales_like', mu=sales_est, sd=eps, observed=data.log_sales)

hierarchical_non_centered_trace = pm.sample(chains = 2, draws=2000, tune=1000, target_accept = 0.9)
``````

`sigma` is not always the catch-all for uncertainty in every model, as models can get very complex, but in most cases, starting with Normally distributed models, it can indeed be seen as the final catch-all of the model’s uncertainty.

Maybe this will help you. Take a look at this code that I’ve quickly chalked up.

``````## Both small datasets have the same mean, 3, and the same underlying generating process
data1 = np.random.normal(3, 1, 3)
data2 = np.random.normal(3, 1, 15)

with pm.Model() as Uncertainty_modeller_1:

mu = pm.Normal('mu', mu = 0, sigma = 10) # I chose a rather narrow prior, to make life easier for the sampler
sigma = pm.HalfFlat('sigma')

likelihood = pm.Normal('likelihood', mu = mu, sigma = sigma, observed = data1)

trace1 = pm.sample(draws=3000, tune=3000, target_accept=0.99)

## Same model as above, but using 2nd dataset, which imparts much more certainty, and thus a smaller sigma
with pm.Model() as Uncertainty_modeller_2:

mu = pm.Normal('mu', mu = 0, sigma = 10)
sigma = pm.HalfFlat('sigma')

likelihood = pm.Normal('likelihood', mu = mu, sigma = sigma, observed = data2)

trace2 = pm.sample(draws=3000, tune=3000, target_accept=0.99)
``````

Now, look at how the estimated sigma is different from the first model to the second, where it’s much smaller (and closer to the true value of `1`) because there is much more data to work with.

Going away from my example to your code now, removing sigma means you do not care about the sigma of the model you are calculating, only the mean. If you only want to figure out the mean, that may be valid, but, most of the time, one intuitively cares about the second and only other parameter of the Normal distribution, sigma.

Here’s an ex. of not caring about sigma, modifying the above models accordingly, and seeing the results.

``````with pm.Model() as Uncertainty_modeller_1:

mu = pm.Normal('mu', mu = 0, sigma = 10)
# Going to now set the sigma arbitrarily as 0.001 below
# sigma = pm.HalfFlat('sigma')
likelihood = pm.Normal('likelihood', mu = mu, sigma = 0.001, observed = data1)

trace1 = pm.sample(draws=3000, tune=3000, target_accept=0.99)

## Same model as above, but using 2nd dataset, which imparts much more certainty, and thus a smaller sigma
with pm.Model() as Uncertainty_modeller_2:

mu = pm.Normal('mu', mu = 0, sigma = 10)
# sigma = pm.HalfFlat('sigma')

likelihood = pm.Normal('likelihood', mu = mu, sigma = 0.001, observed = data2)

trace2 = pm.sample(draws=3000, tune=3000, target_accept=0.99)
``````

The mean is still generally accurately calculated as `~3`, but no sigma is reported anymore, which is all that changes.

(Tell me if my mangled explanation helps understand things a bit more)

Thanks for the effort with the code! It often makes things more clear. In this case it’s easy to see what the sigma adds, but I’m not sure it’s exactly the same as in my parametrization. I think I might have been a bit unclear in my original question (I’ll update it for future readers), probably because I’m not sure exactly what I’m asking .

Note that the model is parametrized as:
`b = mu_b + b_offset * sigma_b`
And analogous for `a`. In this case it seems like `b_offset` and `sigma_b` are tightly coupled. Say that the offset should be 1, then that can be modelled either as `b_offset = 1, sigma_b = 1`, or as `b_offset = 1000, sigma_b = 0.001`. This seems like an identity problem and I don’t see what in the model makes `sigma_b` higher or closer to 1 as we gain more confidence.

Edit: An alternative parametrization which is more intuitive to me would be to add the sigma as standard deviation on the offset, but that also leads to divergences.

I think I’m starting to figure this out now.
The prior for the `offset` is set to N(0, 1). Note the fixed standard deviation. This is multiplied by `sigma` and `sigma * N(0, 1) = N(0, sigma^2)`. This means that the example and explanation by Gon_F was more relevant than I first realized. So the model now have the freedom to adjust the standard deviation of the distribution on `offset` and that’s the mechanism that lets `sigma` indicate the confidence in this particular parametrization.

Another issue with the model is that all X-values are positive, meaning that whenever `beta` decrease, the intercept needs to increase. By centering the X-values I get fewer divergences, but it doesn’t solve the entire problem. It seems to work slightly better to center across the entire dataset, rather than by group.

So the remaining question is if there is any good way of handling this and keeping `sigma` in there? The main problem seems to occur when `sigma` gets very close to zero. It helps to make `sigma` higher by tweaking the prior, but that makes the model biased instead.

Hi Micael,
If I understood correctly, I think you’re asking what’s the difference between and role of `mu_a` and `sigma_a`. I’d say this is the core of hierarchical models. In short:

• `mu_a` estimates the population mean, i.e the mean across clusters – here the products.
• `sigma_a` estimates the variation in the population, i.e the variation across clusters – here, to what extent do all product types look alike? The more alike the clusters, the smaller `sigma_a` – which means all product-level parameters will be close to the population mean, `mu_a`, a phenomenon called shrinkage.

Here your clusters are product types, but it can be any non-ordered and discrete entity (individuals, States, years, cafés…). The idea behind it is that all products are different, but not completely different, so learning about one tells you something about another. So it makes sense to estimate the population-level (all product types) at the same time as you estimate the cluster-level (each product types) – which is why it’s also called a multilevel model.
The non-centered parametrization is useful for the sampler but is usually uglier in code and hence harder to understand. Here is what your model would look like in a centered version:

``````with pm.Model() as hierarchical_model_centered:
# Population-level:
a = pm.Normal('a', mu=0., sd=10.) # mean intercept in population
sigma_a = pm.Exponential('sigma_a', 1.) # variation in intercepts across product types
b = pm.Normal('b', mu=0., sd=1.5)
sigma_b = pm.Exponential('sigma_b', 1.)

# Cluster-level:
a_product = pm.Normal('a_product', mu=a, sd=sigma_a, shape=n_products)
b_product = pm.Normal('b_product', mu=b, sd=sigma_b, shape=n_products)

sales_est = a_product[data.product_id.values] + b_product[data.product_id.values] * data.log_price.values
# Variation within product:
eps = pm.Exponential('eps', 1.)

# Data likelihood
sales_like = pm.Normal('sales_like', mu=sales_est, sd=eps, observed=data.log_sales)
``````

This type of model is very powerful – both scientifically and for prediction – but they are tricky to fit and interpret, so they can’t be explained here, but I’ll point to some resources at the end of my answer.

But first, some thoughts, that I think will be useful:

• You have only two clusters here (`n_products` = 2). Even with a lot of data I think it’s not enough to do a hierarchical model: you’re trying to estimate a standard deviation (`sigma_a`) with only two data points (your two types of products). This is probably one of the reasons you’re getting divergences
• I think the priors are too wide. With hierarchical models you need much more regularizing priors. I gave you (usually) good defaults above, but you should do prior predictive checks to fine-tune them to your case.
• I saw you standardized your predictor and your outcome – that’s very good practice and will often help the sampler, especially for complex models like multilevel ones.

Finally, here are some useful references:

• A port to PyMC3 of McElreath’s Rethinking chapter 12, dedicated to hierarchical models. And the accompanying problem set. Of course, it’s even more helpful when you read the book – which is a masterpiece!
• This thread, which is very close to this one.
• Another thread, more geared towards good practices.

Hope this helps and PyMCheers

And some progress with my model: Since at least some problems occur when `sigma` gets extremely close to zero, I tried using the alternative parametrization for the normal distribution with `tau` instead of `sigma`. This gets rid of almost all divergences. Though the trace looks a bit funny with occasional spikes where `tau` goes to 4000 or something, while it’s normally in the (10, 20) range. Not sure if this is a problem.