Hierarchical Linear Regression - Simplest Possible Example

Hi all,

I’ve attempted to create the simplest possible example illustrate hierarchical models for linear regression (even simpler than the Radon Gas case study).

Problem Setup
The data has 2 univariate Regressions (scalar regression coefficient + no intercept)

\begin{align} x_{1} {} & \sim \text{Normal}(\mu = 0, \sigma^2 = 0.25) \\ x_{2} {} & \sim \text{Normal}(\mu = 0, \sigma^2 = 0.25) \\ \eta {} & \sim \text{Normal}(\mu = 0, \sigma^2 = 1) \\ \beta_1 {} & = 2, \hspace{2mm} \beta_2 = -5 \\ y_1 {} & = \beta_1x_1 + \eta \\ y_2 {} & = \beta_2x_1 + \eta \end{align}

Then I concatenated y_1, y_2 and x_1, x_2 into a single vectors e.g. \mathbf{y} = [y_1^{(0)} \dots y_2^{N_1}, y_2^{(0)} \dots y_2^{N_2}] and added an indicator variable called site to differentiate the data from the 2 different linear models.

def create_dataset() -> pd.DataFrame:
    """
    Assume there are 2 sites with different regression coefficients 
    
    """

    n = 20
    beta_opt_1 = 2.0
    beta_opt_2 = -5.0

    std_noise_x = 0.5
    std_noise_y = 1

    x_1 = std_noise_x * np.random.randn(n)
    y_1 = beta_opt_1 * x_1 + std_noise_y * np.random.randn(n)

    x_2 = std_noise_x * np.random.randn(n)
    y_2 = beta_opt_2 * x_2 + std_noise_y * np.random.randn(n)

    df_1 = pd.DataFrame({"y": y_1, "x": x_1}).assign(
        **{"site": "site_A", "site_idx": 0}
    )
    df_2 = pd.DataFrame({"y": y_2, "x": x_2}).assign(
        **{"site": "site_B", "site_idx": 1}
    )

    df = pd.concat([df_1, df_2], axis="rows")

    return df


data = create_dataset()
n_sites = len(data["site"].unique())
site_indices = data["site_idx"].values

Then, I performed inference using 3 different schemes (similar to the Radon data model)

  • Pooled
  • Unpooled (Individual regressions for y_1, and y_2)
  • Hierarachical Regression

Pooled

with pm.Model() as pooled_model:

    # Define priors
    sigma_pooled = pm.HalfCauchy("sigma_pooled", beta=5)
    beta_pooled = pm.Normal("beta_pooled", mu=0, sd=20)

    # Define likelihood
    likelihood = pm.Normal(
        "y_hat", mu=beta_pooled * data["x"], sd=sigma_pooled, observed=data["y"]
    )

    # Inference!
    trace_pooled = pm.sample(draws=3000)
    
    # Getting one chain
    posterior_beta_pooled = trace_pooled.get_values("beta_pooled", burn=1000, chains=[0])

Unpooled

with pm.Model() as individual_model:

    # Define priors now with mu
    sigma_individual = pm.HalfCauchy("sigma_individual", beta=5, shape=n_sites)
    beta_individual = pm.Normal("beta_individual", mu=0, sd=20, shape=n_sites)

    # Define likelihood
    likelihood = pm.Normal(
        "y_individual",
        mu=beta_individual[site_indices] * data["x"],
        sd=sigma_individual[site_indices],
        observed=data["y"],
    )

    # Inference!
    trace_individual = pm.sample(draws=3000)
    
    posterior_beta_individual = trace_individual.get_values(
        "beta_individual", burn=1000, chains=[0]
    )

Hierarachical

with pm.Model() as hierarchical_model:

    sigma_hierarchical = pm.HalfCauchy("sigma_hierarchical", beta=5, shape=n_sites)

    # The step that makes it hierarchical
    # We only assume beta is linked to the global
    mu_global = pm.Normal("mu_global", mu=0, sd=0.1)
    sd_global = pm.HalfCauchy("sd_global", beta=0.1)
    beta_hierarchical = pm.Normal(
        "beta_hierarchical", mu=mu_global, sd=sd_global, shape=n_sites
    )

    # Define likelihood
    likelihood = pm.Normal(
        "y_hierarchical",
        mu=beta_hierarchical[site_indices] * data["x"],
        sigma=sigma_hierarchical[site_indices],
        observed=data["y"],
    )

    # Inference!
    trace_hierarchical = pm.sample(draws=3000)
    
    posterior_beta_hierarchical = trace_hierarchical.get_values(
        "beta_hierarchical", burn=1000, chains=[1]
    )

The goal of the exercise was to show that the hierarchical estimates sits somewhere in between the pooled and individual estimates. And this difference is effected by a few things

  1. The confidence in the hyper-prior
  2. The amount of data available for the different “sites” (i.e. confidence in the data)

But I find the hierarchical model to always yield posteriors closer to the individual estimates rather than the pooled estimate. In other words, it looks like using a hierarchical formulation doesn’t change the results compared to the individual estimates.

  1. Can I formulate the problem better to illustrate the mechanics of hierarchical modelling?
  2. Is there a different set of hyper-parameters needed to illustrate what I want to show?

Hi,

I read everything rather quickly, please tell me if I miss something. You are using a dataset with only two groups and using two parameters to estimate it. If you want to pool towards the mean I would suggest that you only specify a global mean with a larger sd. You can see that in this way you achieve the pooled effect.

with pm.Model() as hierarchical_model:

    sigma_hierarchical = pm.HalfCauchy("sigma_hierarchical", beta=5, shape=n_sites)

    # The step that makes it hierarchical
    # We only assume beta is linked to the global
    mu_global = pm.Normal("mu_global", mu=0, sd=5)
    #sd_global = pm.HalfCauchy("sd_global", beta=0.1)
    beta_hierarchical = pm.Normal(
        "beta_hierarchical", mu=mu_global, sd=0.5, shape=n_sites
    )

    # Define likelihood
    likelihood = pm.Normal(
        "y_hierarchical",
        mu=beta_hierarchical[site_indices] * data["x"],
        sigma=sigma_hierarchical[site_indices],
        observed=data["y"],
    )

    # Inference!
    trace_hierarchical = pm.sample(draws=3000)
    
    posterior_beta_hierarchical = trace_hierarchical.get_values(
        "beta_hierarchical", burn=1000, chains=[1]
    )

image

1 Like

It worked!

As you suggested the trick was not to have a global standard deviation for the coefficients , only a global mean.

I haven’t thought about it properly yet but what’s the reason behind this?

Is that by trying to also learn the uncertainty you have in coefficients, you actually learn the right thing (individual estimates)?

In the Radon gas example, they do try to learn the uncertainty in the coefficients but it seems to “work”.

1 Like