Why are models hierarchical?

I’m starting again with PyMC3, and I was looking for a simple and intuitive answer to a doubt that has arisen in my re-study:

What are the reasons for hierarchizing the models, and what are the advantages?

Sorry if the question is too simple.


It’s a very broad question. Here is a very simplistic take:

In general hierarchical models are used because you assume that one measurement (can) provide information for another measurement, even if they are not necessarily causally linked. For instance, knowing that five instances of variable X have a mean of 5, and a standard deviation of 2, can inform you about the expected values of the next observations of X, even if you have no particular information about how the instances of X are related to each other.

In contrast, non-hierarchical models, start from “scratch” when doing inferences about each instance of X. This means you need more data to reduce your uncertainty to the same extent that a hierarchical model could.


I’d add that hierarchical models can helpfully let you express more of your knowledge about the data generating process. This can get quite elaborate and tbh is where I initially learned the principle. E.g the classic model of determining whether a coin flip is fair and including in the model a nested hierarchy of details about where the coin was minted (assembly line, factory, region) etc.

To @ricardoV94’s excellent point about constraining variance, I think this is a very cool and subtle property of hierarchical models and worth a little more explanation, so I’ll try.

Imagine you have a really simple linear model y ~ N(a + bx, e). You have a categorical (aka factor) predictor variable v with h levels that you haven’t yet included. Where can you add v?

  1. If (as my notation suggests) you currently multiply coefficient a by 1, then this is a simple pooled intercept that already contains some information about v. However that information is also mixed up with the rest of the model, so it’s not particularly useful.
  2. You could choose to place v on the intercept as a[h], which gives you an unpooled intercept: a separate intercept for each of h levels. (FYI this is the same as one-hot-encoding v into several new binary features).
  3. I’ll skip anything more elaborate for now

So now your model tells you something about how each level h of variable v correlates with y, which is great. However (and to elaborate on @ricardoV94’s point) you might have an imbalance of factor-values in v: e.g. level h_1 is present in 1% of the observations, h_2 in 29%, and h_3 is present in 70% of the observations. In this unpooled model, the coefficients for a_1, a_2, a_3 are fitted completely independently and so the coefficient for a_1 will be much weaker (have more variance) than a_2 and a_3.

In reality this factor-value imbalance might be misleading because the factors aren’t necessarily orthogonal: there might be some shared information amongst them, and we could try to constrain the variance in the under-observed factor-value by using the other factor values.

To do this we could introduce a hierarchy onto this intercept to achieve partial pooling: a ~ N(w, 1), a[h]. Now w will fit to a balanced mean of the levels h_*, and each coefficient a_* will shrink closer towards w, naturally based on their proportions in the data. I.e. a_2 will move a little closer to a_1 and a_3 will move a lot closer to a_1. The variance in a_2 and especially in a_1 would also reduce, giving us a more robust estimate or ‘sharing power’ between the factor-value levels.

As always in this game, there’s no hard and fast rules about building these hierarchies, or even if they’re worthwhile for your particular dataset, and they can have very strange effects on the joint posterior and thus the sampling. In some circumstances it’s a really powerful tool.


I’m not as smart as these other guys, but here is my simple rule. If your data is clustered, then it can make sense because errors can be correlated. So it helps avoid violating assumptions of other statistical methods. Also, the shrinkage towards the mean can help avoid overly confident predictions.


Hi guys, I am also new to PyMC3, thanks for the thorough explanation! I have an additional question about models hierarchical. Let’s say I am trying to find the probability of occurence of demand values for multiple retail products. In the dataset, each product has attributes such as color, brand and whatnot. Like, I want to make sure that characteristics of the products are used in the model to find the prediction for all the products. Is it what you mean with:

" one measurement (can) provide information for another measurement, even if they are not necessarily causally linked."

I believe an example would help. Let’s say I want probabilistic forecast of demand for two shirts, one of color blue, and one of color red. The demand probability for one tshirt should be linked to the demand for the other shirt. Is this achievable with PyMC3?

Thank you!

1 Like

Sure, perhaps the examples notebooks will help: there’s a few on hierarchical modelling:

Likely the core snippet you want to think about is something like:

    beta_mu= pm.Normal('beta_mu', mu=0, sigma=1)
    beta = pm.Normal('beta', mu=beta_mu, sigma=1, shape=n_colors)
1 Like