I’d add that hierarchical models can helpfully let you express more of your knowledge about the data generating process. This can get quite elaborate and tbh is where I initially learned the principle. E.g the classic model of determining whether a coin flip is fair and including in the model a nested hierarchy of details about where the coin was minted (assembly line, factory, region) etc.
To @ricardoV94’s excellent point about constraining variance, I think this is a very cool and subtle property of hierarchical models and worth a little more explanation, so I’ll try.
Imagine you have a really simple linear model y ~ N(a + bx, e)
. You have a categorical (aka factor) predictor variable v
with h
levels that you haven’t yet included. Where can you add v
?
- If (as my notation suggests) you currently multiply coefficient
a
by 1
, then this is a simple pooled intercept that already contains some information about v
. However that information is also mixed up with the rest of the model, so it’s not particularly useful.
- You could choose to place
v
on the intercept as a[h]
, which gives you an unpooled intercept: a separate intercept for each of h
levels. (FYI this is the same as one-hot-encoding v
into several new binary features).
- I’ll skip anything more elaborate for now
So now your model tells you something about how each level h
of variable v
correlates with y
, which is great. However (and to elaborate on @ricardoV94’s point) you might have an imbalance of factor-values in v
: e.g. level h_1
is present in 1% of the observations, h_2
in 29%, and h_3
is present in 70% of the observations. In this unpooled model, the coefficients for a_1
, a_2
, a_3
are fitted completely independently and so the coefficient for a_1
will be much weaker (have more variance) than a_2
and a_3
.
In reality this factor-value imbalance might be misleading because the factors aren’t necessarily orthogonal: there might be some shared information amongst them, and we could try to constrain the variance in the under-observed factor-value by using the other factor values.
To do this we could introduce a hierarchy onto this intercept to achieve partial pooling: a ~ N(w, 1)
, a[h]
. Now w
will fit to a balanced mean of the levels h_*
, and each coefficient a_*
will shrink closer towards w
, naturally based on their proportions in the data. I.e. a_2
will move a little closer to a_1
and a_3
will move a lot closer to a_1
. The variance in a_2
and especially in a_1
would also reduce, giving us a more robust estimate or ‘sharing power’ between the factor-value levels.
As always in this game, there’s no hard and fast rules about building these hierarchies, or even if they’re worthwhile for your particular dataset, and they can have very strange effects on the joint posterior and thus the sampling. In some circumstances it’s a really powerful tool.