Multilevel model specification

Hi Micael,
If I understood correctly, I think you’re asking what’s the difference between and role of mu_a and sigma_a. I’d say this is the core of hierarchical models. In short:

  • mu_a estimates the population mean, i.e the mean across clusters – here the products.
  • sigma_a estimates the variation in the population, i.e the variation across clusters – here, to what extent do all product types look alike? The more alike the clusters, the smaller sigma_a – which means all product-level parameters will be close to the population mean, mu_a, a phenomenon called shrinkage.

Here your clusters are product types, but it can be any non-ordered and discrete entity (individuals, States, years, cafés…). The idea behind it is that all products are different, but not completely different, so learning about one tells you something about another. So it makes sense to estimate the population-level (all product types) at the same time as you estimate the cluster-level (each product types) – which is why it’s also called a multilevel model.
The non-centered parametrization is useful for the sampler but is usually uglier in code and hence harder to understand. Here is what your model would look like in a centered version:

with pm.Model() as hierarchical_model_centered:
    # Population-level:
    a = pm.Normal('a', mu=0., sd=10.) # mean intercept in population
    sigma_a = pm.Exponential('sigma_a', 1.) # variation in intercepts across product types
    b = pm.Normal('b', mu=0., sd=1.5)
    sigma_b = pm.Exponential('sigma_b', 1.)

    # Cluster-level:
    a_product = pm.Normal('a_product', mu=a, sd=sigma_a, shape=n_products)
    b_product = pm.Normal('b_product', mu=b, sd=sigma_b, shape=n_products)
    
    sales_est = a_product[data.product_id.values] + b_product[data.product_id.values] * data.log_price.values
    # Variation within product:
    eps = pm.Exponential('eps', 1.)
    
    # Data likelihood
    sales_like = pm.Normal('sales_like', mu=sales_est, sd=eps, observed=data.log_sales)

This type of model is very powerful – both scientifically and for prediction – but they are tricky to fit and interpret, so they can’t be explained here, but I’ll point to some resources at the end of my answer.

But first, some thoughts, that I think will be useful:

  • You have only two clusters here (n_products = 2). Even with a lot of data I think it’s not enough to do a hierarchical model: you’re trying to estimate a standard deviation (sigma_a) with only two data points (your two types of products). This is probably one of the reasons you’re getting divergences
  • I think the priors are too wide. With hierarchical models you need much more regularizing priors. I gave you (usually) good defaults above, but you should do prior predictive checks to fine-tune them to your case.
  • I saw you standardized your predictor and your outcome – that’s very good practice and will often help the sampler, especially for complex models like multilevel ones.

Finally, here are some useful references:

  • A port to PyMC3 of McElreath’s Rethinking chapter 12, dedicated to hierarchical models. And the accompanying problem set. Of course, it’s even more helpful when you read the book – which is a masterpiece!
  • This thread, which is very close to this one.
  • Another thread, more geared towards good practices.

Hope this helps and PyMCheers :vulcan_salute: