I’d add that hierarchical models can helpfully let you express more of your knowledge about the data generating process. This can get quite elaborate and tbh is where I initially learned the principle. E.g the classic model of determining whether a coin flip is fair and including in the model a nested hierarchy of details about where the coin was minted (assembly line, factory, region) etc.
To @ricardoV94’s excellent point about constraining variance, I think this is a very cool and subtle property of hierarchical models and worth a little more explanation, so I’ll try.
Imagine you have a really simple linear model
y ~ N(a + bx, e). You have a categorical (aka factor) predictor variable
h levels that you haven’t yet included. Where can you add
- If (as my notation suggests) you currently multiply coefficient
1, then this is a simple pooled intercept that already contains some information about
v. However that information is also mixed up with the rest of the model, so it’s not particularly useful.
- You could choose to place
v on the intercept as
a[h], which gives you an unpooled intercept: a separate intercept for each of
h levels. (FYI this is the same as one-hot-encoding
v into several new binary features).
- I’ll skip anything more elaborate for now
So now your model tells you something about how each level
h of variable
v correlates with
y, which is great. However (and to elaborate on @ricardoV94’s point) you might have an imbalance of factor-values in
v: e.g. level
h_1 is present in 1% of the observations,
h_2 in 29%, and
h_3 is present in 70% of the observations. In this unpooled model, the coefficients for
a_3 are fitted completely independently and so the coefficient for
a_1 will be much weaker (have more variance) than
In reality this factor-value imbalance might be misleading because the factors aren’t necessarily orthogonal: there might be some shared information amongst them, and we could try to constrain the variance in the under-observed factor-value by using the other factor values.
To do this we could introduce a hierarchy onto this intercept to achieve partial pooling:
a ~ N(w, 1),
w will fit to a balanced mean of the levels
h_*, and each coefficient
a_* will shrink closer towards
w, naturally based on their proportions in the data. I.e.
a_2 will move a little closer to
a_3 will move a lot closer to
a_1. The variance in
a_2 and especially in
a_1 would also reduce, giving us a more robust estimate or ‘sharing power’ between the factor-value levels.
As always in this game, there’s no hard and fast rules about building these hierarchies, or even if they’re worthwhile for your particular dataset, and they can have very strange effects on the joint posterior and thus the sampling. In some circumstances it’s a really powerful tool.