Partial pooling: Understanding sampling efficiency and "scale dependence"

I haven’t been able to fully wrap my head around the “simpson paradox” example yet. For the data generating process in the example it seems to be quite challenging to set up a straightforward (hierarchical) model that has no divergences/low ESS values for certain variables, cf. discussion in Simpsons paradox and mixed models - large number of groups - #16 by aseyboldt

In order to understand this I have tried to simplify the data generation process as follows. I am simply drawing (univariate) random normal variates having a group-specific mean \mu_g and a variance \sigma^2 (the same for all groups for the sake of simplicity). The group-specific mean is random normally distributed with location (mean) \mu and variance \omega^2.

In other words I have an intra-group variability \sigma and an inter-group variability \omega in my model.

To start with I set \mu=2, \sigma=0.75, and \omega=0.1, i.e. the inter-group variability (of the mean) is significantly smaller than the intra-group variability of observations (around the group mean).

For later experiments I can change num_groups, num_samples_per_group, and omega in my data generating process.

def generate_data(num_groups: int = 5, num_samples_per_group: int = 20, omega: float = 0.1) -> pd.DataFrame:
    rng = np.random.default_rng(seed=41)
    mu = 2.0
    # intra-group varability of observations, same for all groups
    sigma = 0.75
    # inter-group variability in location mu_group
    # omega = 0.1
    result = {"group": [], "observation": []}
    for i in range(num_groups):
        mu_group = rng.normal(loc=mu, scale=omega)
        result["group"] += [i] * num_samples_per_group
        result["observation"] += rng.normal(loc=mu_group, scale=sigma, size=num_samples_per_group).tolist()
    return pd.DataFrame(result)

The hierarchical model has population level variables \mu, \sigma, and \omega. It does not model a group specific variability of observations, i.e. \sigma_g. The group specific mean is modelled in the “standard” way (i.e. essentially copy & paste from the examples above).

def make_model_v1(frame: pd.DataFrame) -> pm.Model:
    """No group-specific volatility in the observation generating process, i.e. 'sigma_group = sigma_population'."""
    coords = {"group": frame["group"].unique()}
    with pm.Model(coords=coords) as model:
        # Data
        y = pm.MutableData("y", frame["observation"])
        g = pm.MutableData("g", frame["group"])

        # Population level parameters
        mu = pm.Normal("mu")
        sigma = pm.HalfNormal("sigma")
        omega = pm.HalfNormal("omega", 10.0)

        # Group level parameters
        epsilon_mu_g = pm.ZeroSumNormal("epsilon_mu_g", dims="group")
        mu_g = pm.Deterministic("mu_g", mu + omega * epsilon_mu_g)

        # Likelihood
        pm.Normal("y_obs", mu=mu_g[g], sigma=sigma, observed=y)
    return model

For omega=0.1 the model seems to behave perfectly reasonable, e.g. increasing num_groups “narrows” the posterior on the true values for \mu, \sigma, and \omega`, cf. table below

omega (true value) num_groups num_samples_per_group parameter mean hdi_3% hdi_97% ess_bulk ess_tail r_hat
0 0.1 5 20 mu 1.902 1.762 2.024 2382 1913 1
1 0.1 5 20 omega 0.238 0 0.587 883 980 1.01
2 0.1 5 20 sigma 0.706 0.607 0.801 2842 2585 1
3 0.1 50 20 mu 1.985 1.94 2.031 4798 2500 1
4 0.1 50 20 omega 0.071 0 0.137 1266 2026 1
5 0.1 50 20 sigma 0.772 0.739 0.806 4429 2337 1
6 0.1 500 20 mu 1.995 1.98 2.009 8185 2322 1
7 0.1 500 20 omega 0.096 0.07 0.119 1482 1985 1
8 0.1 500 20 sigma 0.751 0.74 0.761 5512 2852 1

One can also observe that the ESS increases with increasing number of groups. My interpretation is that the only evidence for omega is the between group variability and for a small number of groups there is just not much evidence that can be gathered. This in turn leads to a small ESS, especially for omega.

This is somewhat counterintuitive to the “there is no minimum sample size for Bayesian data analysis” statement since the presence of somewhat low ESS leaves me (as someone getting just started) doubtful that I will be able to trust my hierarchical model (due to problems with posterior sampling) in situations where I don’t control the data generating process. I guess I am overlooking something here though, but any input would be very welcome.

Now if I increase omega by a factor of ten and run the same experiment, then I observe that the posterior “hdi” does not contain the true value for 5 and 50 groups and the ESS values for omega drop significantly compared to the ones for the experiment with omega=0.1,

omega (true value) num_groups num_samples_per_group parameter mean hdi_3% hdi_97% ess_bulk ess_tail r_hat
0 1 5 20 mu 1.746 1.606 1.872 3210 1771 1
1 1 5 20 omega 1.112 0.413 2.077 660 841 1.01
2 1 5 20 sigma 0.706 0.61 0.802 2523 2146 1
3 1 50 20 mu 1.861 1.817 1.909 6323 2901 1
4 1 50 20 omega 0.922 0.737 1.117 501 670 1.01
5 1 50 20 sigma 0.773 0.741 0.807 4813 2549 1
6 1 500 20 mu 1.988 1.975 2.002 9388 2736 1
7 1 500 20 omega 0.991 0.93 1.051 264 665 1
8 1 500 20 sigma 0.75 0.741 0.761 7152 2465 1

For omega=10, the results look as follows

omega (true value) num_groups num_samples_per_group parameter mean hdi_3% hdi_97% ess_bulk ess_tail r_hat
0 10 5 20 mu 0.149 0.018 0.288 1905 1915 1
1 10 5 20 omega 8.656 3.897 13.736 594 879 1.01
2 10 5 20 sigma 0.705 0.615 0.802 1541 1921 1
3 10 50 20 mu 0.623 0.578 0.672 3588 2597 1
4 10 50 20 omega 8.89 7.173 10.57 132 127 1.03
5 10 50 20 sigma 0.772 0.741 0.804 3405 2963 1
6 10 500 20 mu 1.922 1.909 1.936 939 997 1.01
7 10 500 20 omega 9.888 9.433 10.931 5 12 2.11
8 10 500 20 sigma 0.75 0.741 0.761 414 853 1

I am not entirely sure what I should make of this. I could say that the standard error of the estimate for mu is essentially omega / sqrt(num_groups) and the deviation between the posterior mean and the true value seems to be compatible with that number. Also if omega is larger than sigma does it make sense to talk about a (group-level) random effect, there is not much that those groups are sharing (in terms of the average of the observations within each group).

I’d be glad for any suggestions how to reconcile this intuition with the observed posterior summaries.