Hello - I’m really enjoying the library and hoping someone can clarify some conceptual things for me. I’ve watched 9 hours of Scipy presentations and gone through multiple tutorial notebooks and am still unclear on some things.
I’m building a model very similar to the Partial Pooling Model demonstrated here.
This is the snippet I’m interested in:
with pm.Model(coords=coords) as partial_pooling:
county_idx = pm.Data("county_idx", county, dims="obs_id")
# Hyperpriors:
a = pm.Normal("a", mu=0.0, sigma=10.0)
sigma_a = pm.Exponential("sigma_a", 1.0)
# Varying intercepts:
a_county = pm.Normal("a_county", mu=a, sigma=sigma_a, dims="County")
# Expected value per county:
theta = a_county[county_idx]
# Model error:
sigma = pm.Exponential("sigma", 1.0)
y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id")
pm.model_to_graphviz(partial_pooling)
What I don’t understand is why is theta called the “Expected value per county”? Why is the likelihood (y) not sampled when wanting to sample expected distributions for each group in the hierarchy?
I’m trying to do something very similar with a hierarchical model where I estimate distributions of groups in a single model but, I’m not sure I’m comparing the distributions from PyMC3 to my original data correctly.
In my code I convert my trace “theta” using az.from_pymc3() which is a (4000, n_groups) shaped array into sampled distributions for each group and plot against the original data. But I can also convert my likelihood (y) into an array which is (4000, size_of_original_data) and plot that against the original data and both look reasonable. I’m 99% sure if I want to compare my univariate original data distributions to the sampled expected distributions I would use “theta” as shown in all the tutorials, but I don’t understand why. Using Arviz’s built in plots are helpful but I need to understand the raw sampled data for add flexibility. I’ll add my code below and thanks for any insights or reference material.
coords = {'groups':data['hierarchy'].unique(),
'obs_id':np.arange(len(data))
}
with pm.Model(coords=coords) as model_partial_pooling:
# Hyperpriors for group nodes
alpha_a = pm.HalfNormal('alpha_a', sigma=observed_mean)
beta_a = pm.HalfNormal('beta_a', sigma=1.)
a = pm.Gamma('a', alpha=alpha_a, beta=beta_a, dims='groups')
# Expected value per groups
ac_est = a[groups_idx]
# Model error
eps = pm.HalfCauchy('eps', beta=1.)
# Data likelihood
ac_dists = pm.Normal('ac_dists',
mu=ac_est,
sigma=eps,
observed=data['scaled_target'],
dims='obs_id'
)
prior_partial_pooling = pm.sample_prior_predictive()
# # Inference button (TM)!
with model_partial_pooling:
trace_partial_pooling = pm.sample(draws=2000,
tune=2000,
cores=1,
#return_inferencedata=True,
init='adapt_diag')
with model_partial_pooling:
ppc_partial_pooling = pm.sample_posterior_predictive(
trace_partial_pooling, var_names=['alpha_a',
'a',
'eps',
'ac_dists']
)
data_partial_pooling = az.from_pymc3(trace=trace_partial_pooling,
prior=prior_partial_pooling,
posterior_predictive=ppc_partial_pooling,
model=model_partial_pooling
)
### This is the dataframe where I compare each group to the original groups data ###
# Assign the groups as column names to their posterior distribution
ac_groups = data_partial_pooling.posterior_predictive.groups
ac_ppc_partial_pooling = data_partial_pooling.posterior_predictive.a[0]\
.to_dataframe().reset_index()
assert ac_ppc_partial_pooling['chain'].nunique() == 1
ac_ppc_partial_pooling.drop(['draw', 'chain'], axis=1, inplace=True)
ac_ppc_partial_pooling.columns = ['group', 'data']
ac_ppc_partial_pooling = ac_ppc_partial_pooling.pivot(columns='group')
ac_ppc_partial_pooling.columns = ac_ppc_partial_pooling.columns.get_level_values(1)
ac_ppc_partial_pooling.head()
Plot of my “ac_ppc_partial_pooling” data frame data against the original data for each group
The poor fit is another problem but I want to ensure I’m even comparing the right things