Hierarchical model: Effect of global hyper prior

Dear Bayesians,

I am trying to understand the effect of a global hyper prior in a hierarchical model, which is just estimating p from a Bernoulli.

My data generating process has 3 groups with different p’s. We can imagine to have 3 age groups with different conversion rates.

Groups also differ in size: 2 groups are large (n=100), one is smaller (n=20). I get the following data average conversion rates:

0-49     0.29     # n=100
50-69    0.17    # n=100
70+      0.05     # n=20

I want to compare how a hierarchical model deals with these numbers compared to a non-hierarchical model.

Hierarchical model:

age_group_idx, age_group_unique = pd.factorize(df_conversions["age_group"], sort=True)

COORDS = {"age_group": ["0-49", "50-69", "70+"]}

with pm.Model(coords=COORDS) as hierarchical_conversion_model:

    alpha_hyper = pm.Gamma("alpha_hyper", alpha=2, beta=2)
    beta_hyper = pm.Gamma("beta_hyper", alpha=2, beta=2)

    p = pm.Beta("p", alpha=alpha_hyper, beta=beta_hyper, dims="age_group")

    conversion_rate = pm.Bernoulli("conversion_rate", p=p[age_group_idx], observed=df_conversions["conversion"].values)

    hierarchical_idata_conversions = pm.sample(1000, return_inferencedata=True)

Non-hierarchical model:

with pm.Model(coords=COORDS) as nonhierarchical_conversion_model:

    p = pm.Beta("p", alpha=2, beta=2, dims="age_group")

    conversion_rate = pm.Bernoulli("conversion_rate", p=p[age_group_idx], observed=df_conversions["conversion"].values)

    nonhierarchical_idata_conversions = pm.sample(1000, return_inferencedata=True)

According to what I read about shrinkage, I would have expected the estimated conversion rate of the smallest group to be drawn in the direction of the other groups. However, the effect is the opposite:

Hierarchical model:

Non-hierarchical model:

Obviously, the hierarchical model catches the “observed conversion rate” of the 70+ age group much better, but given the small amount of data, it is not clear to me if this is good or bad. I would have expected the hierarchical model to see the 70+ data drawn in the direction of the global conversion rate (=shrinkage), but this is clearly not the case. Why is that?

Best regards and a happy christmas!
Matthias

I thought about this issue for a while now and want to add some additional insights, which might be an explanation. I would be interested in your thoughts about my reasoning.

By using Gamma hyper priors for the alpha and beta parameters of the Beta prior, I probably tought the model to learn about the common shape of the distributions of p, not so much about the actual value range (especially the center of mass). Therefore, there can probably not be shrinkage with regards to the center of mass.

This can be changed by defining the Beta prior by mu and sigma instead of alpha and beta. Then, there should be a shrinkage observable to the global estimation (hyper prior) of the mu parameter. This would look like this:

age_group_idx, age_group_unique = pd.factorize(df_conversions["age_group"], sort=True)

COORDS = {"age_group": ["0-49", "50-69", "70+"]}

with pm.Model(coords=COORDS) as hierarchical_conversion_model:

    mu = pm.Uniform("mu", lower=0.0, upper=0.5)
    sigma = pm.HalfNormal("sigma", sigma=0.1)

    p = pm.Beta("p", mu=mu, sigma=sigma, dims="age_group")

    conversion_rate = pm.Bernoulli("conversion_rate", p=p[age_group_idx], observed=df_conversions["conversion"].values)

    hierarchical_idata_conversions = pm.sample(1000, tune=1000, return_inferencedata=True, target_accept=0.99)

The first thing to note is that this model has more problems with divergences, why I raised target_accept.

Second, there is now a notable shrinkage between the three estimated p distributions:

This is now much more in line with what I expected. If my reasoning is true, I learned an important lesson about the usage of hyper priors. In this case, I have to decide, what should be “common” to all p estimates: the shape (by estimating alpha and beta globally) or the mean (by estimation mu globally).

Best regards
Matthias

2 Likes

This worked example from Statistical Rethinking is great for illustrating how hierarchical priors work. There’s a similar but more complex example using a multivariate normal prior in this lecture as well.

The divergences in your last example can probably be eliminated by working with logit_p instead of p, and using a normal for mu and logit_p (sigma of course still needs to be a strictly positive distribution).

1 Like