Multilevel model specification

I’m trying to use a hierarchical model to analyze my data, but I’m running into a lot of divergences.
I’m basing the model specification on a blog post by Thomas Wiecki (https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/). Also see a full working example of my use case below.

I don’t really understand the reason for sigma_a and sigma_b in this specification. The blog post says they can indicate confidence in the offsets from the mean, but I think that’s also indicated by the distribution on those actual offsets. Further, it seems to me that they will correlate with a_offset and b_offset.
For example, sigma_a could increase a lot as long as a_offset is also shrinking towards 0. When looking at the traces the sigmas also seem to have a similar funnel-tendency as described in the blog post.
If I remove the sigmas from the model specification, the problem with divergences disappear so this seems to be at least part of the problem. What are sigma_a and sigma_b useful for and what am I giving up by removing them?

EDIT FOR CLARIFICATION:
Note that the model is parametrized as:
b = mu_b + b_offset * sigma_b
And analogous for a . In this case it seems like b_offset and sigma_b are tightly coupled. Say that the offset should be 1, then that can be modelled either as b_offset = 1, sigma_b = 1 , or as b_offset = 1000, sigma_b = 0.001 . This seems like an identity problem and I don’t see what in the model makes sigma_b higher or closer to 1 as we gain more confidence.

Full code example:

import pandas as pd
import pymc3 as pm 

data = pd.DataFrame({
    'log_sales': [1.94591015, 2.39789527, 1.94591015, 2.39789527, 1.79175947, 2.48490665,
         1.94591015, 2.30258509, 1.94591015, 2.77258872, 1.60943791, 1.94591015,
         1.38629436, 1.38629436, 0.69314718, 1.09861229],
    'log_price': [2.30258509, 2.30258509, 2.30258509, 2.30258509, 2.39789527, 2.39789527,
             2.39789527, 2.39789527, 1.79175947, 1.79175947, 1.79175947, 1.79175947,
             2.07944154, 2.07944154, 2.07944154, 2.07944154],
    'product_id': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
})
n_products = 2


with pm.Model() as hierarchical_model_non_centered:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
    sigma_a = pm.HalfCauchy('sigma_a', 5)
    mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
    sigma_b = pm.HalfCauchy('sigma_b', 5)

    # Transformed:
    a_offset = pm.Normal('a_offset', mu=0, sd=1, shape=n_products)
    a = pm.Deterministic("a", mu_a + a_offset * sigma_a)

    b_offset = pm.Normal('b_offset', mu=0, sd=1, shape=n_products)
    b = pm.Deterministic("b", mu_b + b_offset * sigma_b)

    # Model error
    eps = pm.HalfCauchy('eps', 5)
    
    sales_est = a[data.product_id.values] + b[data.product_id.values] * data.log_price.values
    
    # Data likelihood
    sales_like = pm.Normal('sales_like', mu=sales_est, sd=eps, observed=data.log_sales)
    
    hierarchical_non_centered_trace = pm.sample(chains = 2, draws=2000, tune=1000, target_accept = 0.9)

Hey, micke. I have thought about this before, what sigma and mu separately mean within a single distribution prior, in a larger model, and the answer I have come to is that while the latter tries to get you the right answer, the former accounts for the uncertainty that is “inherent” to all estimation.

Even OLS, the simplest of models, always gives a confidence to the estimates it spits back, so do you think your model here will fundamentally be working correctly, if you try to remove either sigma_a or sigma_b?

What does your code look like when you remove them? I would also try playing around with both the prior and sampler settings.

Thanks for your answer. So if I’m getting your explanation correctly it sounds like some kind of catch all for uncertainty that may be in the estimation? I’ll need to think about how that actually manifests in the sampling here because I’m still not getting how the uncertainty can be modelled like that: what effect makes the sigma lower if we are more uncertain and wouldn’t mu just increase as a reaction then?

Whether my model works correctly when removing the sigmas is what I’m trying to figure out and I’m not sure exactly how to go about that, hence my question about their relevance.

I need to remove both of the sigmas to avoid the divergences and also have to increase target_accept to 0.9. This is what the code looks like then. I’ve just commented out the parts that are removed to make it easy to compare:

import pandas as pd
import pymc3 as pm 

data = pd.DataFrame({
    'log_sales': [1.94591015, 2.39789527, 1.94591015, 2.39789527, 1.79175947, 2.48490665,
         1.94591015, 2.30258509, 1.94591015, 2.77258872, 1.60943791, 1.94591015,
         1.38629436, 1.38629436, 0.69314718, 1.09861229],
    'log_price': [2.30258509, 2.30258509, 2.30258509, 2.30258509, 2.39789527, 2.39789527,
             2.39789527, 2.39789527, 1.79175947, 1.79175947, 1.79175947, 1.79175947,
             2.07944154, 2.07944154, 2.07944154, 2.07944154],
    'product_id': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
})
n_products = 2


with pm.Model() as hierarchical_model_non_centered:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0., sd=100**2)
    #sigma_a = pm.HalfCauchy('sigma_a', 5)
    mu_b = pm.Normal('mu_b', mu=0., sd=100**2)
    #sigma_b = pm.HalfCauchy('sigma_b', 5)

    # Transformed:
    a_offset = pm.Normal('a_offset', mu=0, sd=1, shape=n_products)
    a = pm.Deterministic("a", mu_a + a_offset) # * sigma_a)

    b_offset = pm.Normal('b_offset', mu=0, sd=1, shape=n_products)
    b = pm.Deterministic("b", mu_b + b_offset) # * sigma_b)

    # Model error
    eps = pm.HalfCauchy('eps', 5)
    
    sales_est = a[data.product_id.values] + b[data.product_id.values] * data.log_price.values
    
    # Data likelihood
    sales_like = pm.Normal('sales_like', mu=sales_est, sd=eps, observed=data.log_sales)
    
    hierarchical_non_centered_trace = pm.sample(chains = 2, draws=2000, tune=1000, target_accept = 0.9)

sigma is not always the catch-all for uncertainty in every model, as models can get very complex, but in most cases, starting with Normally distributed models, it can indeed be seen as the final catch-all of the model’s uncertainty.

Maybe this will help you. Take a look at this code that I’ve quickly chalked up.

## Both small datasets have the same mean, 3, and the same underlying generating process
data1 = np.random.normal(3, 1, 3)
data2 = np.random.normal(3, 1, 15)

with pm.Model() as Uncertainty_modeller_1:
    
    mu = pm.Normal('mu', mu = 0, sigma = 10) # I chose a rather narrow prior, to make life easier for the sampler
    sigma = pm.HalfFlat('sigma')

    likelihood = pm.Normal('likelihood', mu = mu, sigma = sigma, observed = data1)
    
    trace1 = pm.sample(draws=3000, tune=3000, target_accept=0.99)

    
## Same model as above, but using 2nd dataset, which imparts much more certainty, and thus a smaller sigma
with pm.Model() as Uncertainty_modeller_2:
    
    mu = pm.Normal('mu', mu = 0, sigma = 10)
    sigma = pm.HalfFlat('sigma')
    
    likelihood = pm.Normal('likelihood', mu = mu, sigma = sigma, observed = data2)
    
    trace2 = pm.sample(draws=3000, tune=3000, target_accept=0.99)


Now, look at how the estimated sigma is different from the first model to the second, where it’s much smaller (and closer to the true value of 1) because there is much more data to work with.

Going away from my example to your code now, removing sigma means you do not care about the sigma of the model you are calculating, only the mean. If you only want to figure out the mean, that may be valid, but, most of the time, one intuitively cares about the second and only other parameter of the Normal distribution, sigma.

Here’s an ex. of not caring about sigma, modifying the above models accordingly, and seeing the results.

with pm.Model() as Uncertainty_modeller_1:
    
    mu = pm.Normal('mu', mu = 0, sigma = 10)
    # Going to now set the sigma arbitrarily as 0.001 below
    # sigma = pm.HalfFlat('sigma')
    likelihood = pm.Normal('likelihood', mu = mu, sigma = 0.001, observed = data1)
    
    trace1 = pm.sample(draws=3000, tune=3000, target_accept=0.99)

    
## Same model as above, but using 2nd dataset, which imparts much more certainty, and thus a smaller sigma
with pm.Model() as Uncertainty_modeller_2:
    
    mu = pm.Normal('mu', mu = 0, sigma = 10)
    # sigma = pm.HalfFlat('sigma')

    likelihood = pm.Normal('likelihood', mu = mu, sigma = 0.001, observed = data2)
    
    trace2 = pm.sample(draws=3000, tune=3000, target_accept=0.99)


The mean is still generally accurately calculated as ~3, but no sigma is reported anymore, which is all that changes.

(Tell me if my mangled explanation helps understand things a bit more)

Thanks for the effort with the code! It often makes things more clear. In this case it’s easy to see what the sigma adds, but I’m not sure it’s exactly the same as in my parametrization. I think I might have been a bit unclear in my original question (I’ll update it for future readers), probably because I’m not sure exactly what I’m asking :slight_smile: .

Note that the model is parametrized as:
b = mu_b + b_offset * sigma_b
And analogous for a. In this case it seems like b_offset and sigma_b are tightly coupled. Say that the offset should be 1, then that can be modelled either as b_offset = 1, sigma_b = 1, or as b_offset = 1000, sigma_b = 0.001. This seems like an identity problem and I don’t see what in the model makes sigma_b higher or closer to 1 as we gain more confidence.

Edit: An alternative parametrization which is more intuitive to me would be to add the sigma as standard deviation on the offset, but that also leads to divergences.

I think I’m starting to figure this out now.
The prior for the offset is set to N(0, 1). Note the fixed standard deviation. This is multiplied by sigma and sigma * N(0, 1) = N(0, sigma^2). This means that the example and explanation by Gon_F was more relevant than I first realized. So the model now have the freedom to adjust the standard deviation of the distribution on offset and that’s the mechanism that lets sigma indicate the confidence in this particular parametrization.

Another issue with the model is that all X-values are positive, meaning that whenever beta decrease, the intercept needs to increase. By centering the X-values I get fewer divergences, but it doesn’t solve the entire problem. It seems to work slightly better to center across the entire dataset, rather than by group.

So the remaining question is if there is any good way of handling this and keeping sigma in there? The main problem seems to occur when sigma gets very close to zero. It helps to make sigma higher by tweaking the prior, but that makes the model biased instead.

Hi Micael,
If I understood correctly, I think you’re asking what’s the difference between and role of mu_a and sigma_a. I’d say this is the core of hierarchical models. In short:

  • mu_a estimates the population mean, i.e the mean across clusters – here the products.
  • sigma_a estimates the variation in the population, i.e the variation across clusters – here, to what extent do all product types look alike? The more alike the clusters, the smaller sigma_a – which means all product-level parameters will be close to the population mean, mu_a, a phenomenon called shrinkage.

Here your clusters are product types, but it can be any non-ordered and discrete entity (individuals, States, years, cafés…). The idea behind it is that all products are different, but not completely different, so learning about one tells you something about another. So it makes sense to estimate the population-level (all product types) at the same time as you estimate the cluster-level (each product types) – which is why it’s also called a multilevel model.
The non-centered parametrization is useful for the sampler but is usually uglier in code and hence harder to understand. Here is what your model would look like in a centered version:

with pm.Model() as hierarchical_model_centered:
    # Population-level:
    a = pm.Normal('a', mu=0., sd=10.) # mean intercept in population
    sigma_a = pm.Exponential('sigma_a', 1.) # variation in intercepts across product types
    b = pm.Normal('b', mu=0., sd=1.5)
    sigma_b = pm.Exponential('sigma_b', 1.)

    # Cluster-level:
    a_product = pm.Normal('a_product', mu=a, sd=sigma_a, shape=n_products)
    b_product = pm.Normal('b_product', mu=b, sd=sigma_b, shape=n_products)
    
    sales_est = a_product[data.product_id.values] + b_product[data.product_id.values] * data.log_price.values
    # Variation within product:
    eps = pm.Exponential('eps', 1.)
    
    # Data likelihood
    sales_like = pm.Normal('sales_like', mu=sales_est, sd=eps, observed=data.log_sales)

This type of model is very powerful – both scientifically and for prediction – but they are tricky to fit and interpret, so they can’t be explained here, but I’ll point to some resources at the end of my answer.

But first, some thoughts, that I think will be useful:

  • You have only two clusters here (n_products = 2). Even with a lot of data I think it’s not enough to do a hierarchical model: you’re trying to estimate a standard deviation (sigma_a) with only two data points (your two types of products). This is probably one of the reasons you’re getting divergences
  • I think the priors are too wide. With hierarchical models you need much more regularizing priors. I gave you (usually) good defaults above, but you should do prior predictive checks to fine-tune them to your case.
  • I saw you standardized your predictor and your outcome – that’s very good practice and will often help the sampler, especially for complex models like multilevel ones.

Finally, here are some useful references:

  • A port to PyMC3 of McElreath’s Rethinking chapter 12, dedicated to hierarchical models. And the accompanying problem set. Of course, it’s even more helpful when you read the book – which is a masterpiece!
  • This thread, which is very close to this one.
  • Another thread, more geared towards good practices.

Hope this helps and PyMCheers :vulcan_salute:

Hi Alex, thanks for your answer!
I’ve quickly had a look through the linked resources but will go through them more carefully as soon as I have time.
For your specific points:

  • Number of clusters/products: I actually have thousands and plan to split them into groups of related products. For this post, I just used a very small dataset in order to highlight the issue I’m facing. I run into the same problem with more products
  • Wide priors: Kind of the same thing here. For this thread I copy-pasted the code from the blog article I linked in the initial post. It had very wide priors. In my specific case, I want to use a rather narrow prior. Since I’m modelling the relationship between price and sales (price elasticity of demand) I already know it will be negative and typically below -1. For each product category I also have a good guess about the group mean which may allow me to narrow the priors further. I will probably switch to another distribution, such as negative Lognormal as I believe the elasticity will often be in the (-3, -2) range, but can be much lower in a few cases.
  • Standardizing. Yes, thanks :slight_smile: if you have further tips on how to balance that in multi-level models I’m interested. It’s possible to either standardize per cluster or over the entire dataset, and I guess there are also other possible approaches so I’d like to know a bit more about the best practice here.

And some progress with my model: Since at least some problems occur when sigma gets extremely close to zero, I tried using the alternative parametrization for the normal distribution with tau instead of sigma. This gets rid of almost all divergences. Though the trace looks a bit funny with occasional spikes where tau goes to 4000 or something, while it’s normally in the (10, 20) range. Not sure if this is a problem.

1 Like