Hi all,
I’ve attempted to create the simplest possible example illustrate hierarchical models for linear regression (even simpler than the Radon Gas case study).
Problem Setup
The data has 2 univariate Regressions (scalar regression coefficient + no intercept)
Then I concatenated y_1, y_2 and x_1, x_2 into a single vectors e.g. \mathbf{y} = [y_1^{(0)} \dots y_2^{N_1}, y_2^{(0)} \dots y_2^{N_2}] and added an indicator variable called site
to differentiate the data from the 2 different linear models.
def create_dataset() -> pd.DataFrame:
"""
Assume there are 2 sites with different regression coefficients
"""
n = 20
beta_opt_1 = 2.0
beta_opt_2 = -5.0
std_noise_x = 0.5
std_noise_y = 1
x_1 = std_noise_x * np.random.randn(n)
y_1 = beta_opt_1 * x_1 + std_noise_y * np.random.randn(n)
x_2 = std_noise_x * np.random.randn(n)
y_2 = beta_opt_2 * x_2 + std_noise_y * np.random.randn(n)
df_1 = pd.DataFrame({"y": y_1, "x": x_1}).assign(
**{"site": "site_A", "site_idx": 0}
)
df_2 = pd.DataFrame({"y": y_2, "x": x_2}).assign(
**{"site": "site_B", "site_idx": 1}
)
df = pd.concat([df_1, df_2], axis="rows")
return df
data = create_dataset()
n_sites = len(data["site"].unique())
site_indices = data["site_idx"].values
Then, I performed inference using 3 different schemes (similar to the Radon data model)
- Pooled
- Unpooled (Individual regressions for y_1, and y_2)
- Hierarachical Regression
Pooled
with pm.Model() as pooled_model:
# Define priors
sigma_pooled = pm.HalfCauchy("sigma_pooled", beta=5)
beta_pooled = pm.Normal("beta_pooled", mu=0, sd=20)
# Define likelihood
likelihood = pm.Normal(
"y_hat", mu=beta_pooled * data["x"], sd=sigma_pooled, observed=data["y"]
)
# Inference!
trace_pooled = pm.sample(draws=3000)
# Getting one chain
posterior_beta_pooled = trace_pooled.get_values("beta_pooled", burn=1000, chains=[0])
Unpooled
with pm.Model() as individual_model:
# Define priors now with mu
sigma_individual = pm.HalfCauchy("sigma_individual", beta=5, shape=n_sites)
beta_individual = pm.Normal("beta_individual", mu=0, sd=20, shape=n_sites)
# Define likelihood
likelihood = pm.Normal(
"y_individual",
mu=beta_individual[site_indices] * data["x"],
sd=sigma_individual[site_indices],
observed=data["y"],
)
# Inference!
trace_individual = pm.sample(draws=3000)
posterior_beta_individual = trace_individual.get_values(
"beta_individual", burn=1000, chains=[0]
)
Hierarachical
with pm.Model() as hierarchical_model:
sigma_hierarchical = pm.HalfCauchy("sigma_hierarchical", beta=5, shape=n_sites)
# The step that makes it hierarchical
# We only assume beta is linked to the global
mu_global = pm.Normal("mu_global", mu=0, sd=0.1)
sd_global = pm.HalfCauchy("sd_global", beta=0.1)
beta_hierarchical = pm.Normal(
"beta_hierarchical", mu=mu_global, sd=sd_global, shape=n_sites
)
# Define likelihood
likelihood = pm.Normal(
"y_hierarchical",
mu=beta_hierarchical[site_indices] * data["x"],
sigma=sigma_hierarchical[site_indices],
observed=data["y"],
)
# Inference!
trace_hierarchical = pm.sample(draws=3000)
posterior_beta_hierarchical = trace_hierarchical.get_values(
"beta_hierarchical", burn=1000, chains=[1]
)
The goal of the exercise was to show that the hierarchical estimates sits somewhere in between the pooled and individual estimates. And this difference is effected by a few things
- The confidence in the hyper-prior
- The amount of data available for the different “sites” (i.e. confidence in the data)
But I find the hierarchical model to always yield posteriors closer to the individual estimates rather than the pooled estimate. In other words, it looks like using a hierarchical formulation doesn’t change the results compared to the individual estimates.
- Can I formulate the problem better to illustrate the mechanics of hierarchical modelling?
- Is there a different set of hyper-parameters needed to illustrate what I want to show?