Hello Bayesians!

Suppose I have data which I want to use partial pooling to describe. I’m interested in both the global and the individual.

Further, suppose that one individual has simply contributed more observations than others. For example, suppose context is evaluating the performance boost associated with energy drinks and runners. Perhaps one (or a handful of runners) go on far more runs than the other runners. Such runner(s) which contribute far more data than others.

My question is— in such a scenario where individuals contribute more data than most others, so they have unfair pull on inference? (Ex they dominate the posterior and pull estimates of all other runners)

Or is partial pooling / multilevel models robust against such data Imbalance?

Why or why not? Thanks!!

Partial pooling is robust to that. You could observe one of the members parameters (and therefore have zero uncertainty) and it would still count as a single datapoint to the group level estimates.

Hi Ricardo, thanks!

Could you add a few more details on the why?

I’d need to defend why I think partial pooling makes sense to a technical audience given the concern that individuals with more data points would overpower others with less data points.

In your example, one individual contributes one data point (zero uncertainty) and it is treated as one data point to the population parameter.

Let’s adjust this scenario, say that this individual contributed 10 points. Would the population parameter consider these as 10 individual data points or just 1 (essentially weighted the same as all other individuals?)

Usually multiple datapoints per individual only inform the individual level parameters directly. If you have many observations you become increasingly certain what the individual parameters are, but you don’t become increasingly certain of what the group level distribution looks like.

If I tell you Team A scored precisely 10 goals per game on average last season, you don’t conclude every other team also scored 10 goals. You believe 10 is a slightly more plausible average value for unseen teams, but not much more than if I told you team A scored between 9 or 10 goals on average, but I don’t remember more precisely. You also will think anything between 0-20 is still quite likely for other teams (or whatever is your prior for this sport), but maybe 100 is now a bit unlikely.

The observations from team A only translate to other teams via the team mean, and that’s a single parameter, regardless of if team A played 1 game or 1000 games.

At least that’s how it works the way hierarchical models are usually set up.

1 Like

This lecture from Statistical Rethinking might be nice to look at. It does visual step-by-step parameter updating in a hierarchical model, which helped me a lot in building intuition about what the models do.

1 Like