I have a simple partial pooled model that fits a normal distribution to observed data. In my actual data, I have hundreds of groups with anywhere between 1 and hundreds of observations per group. I tried fitting to the entire dataset and found that I was getting major divergences and sampling issues. Instead, I tried fitting the groups with only some minimum number of observations and then tried again. This time I had really fast sampling with very good results. I am wondering two things:
- Is this valid, or does it just bypass one of the strong points of Bayesian modeling in that I can still make estimates on groups with only few observations? I sort of want the model to be biased towards those with many observations so this doesn’t negatively impact my purpose. I have several projects that have been sidelined by sampling issues on groups with only few observations. If this approach is valid then it would help me get those back on track and use the out of sample predictions for those unseen small-sample groups.
- How do I do out of model/sample predictions with new groups in the hierarchical structure while also taking into account the observations I do have for those groups. This last part is key, because I do want to consider the information that I actually have for those new groups.
I have read this post, but I am still not clear. For one, the example with the schools only has one observation per group so the new group’s priors are just new means centered at that value. I guess I could just take the mean of the few points I have for my new groups, but I feel like I am throwing away some useful information.
I tried using data containers for the observed data and index variable for the hierarchical structure, but I am not sure if I then need to modify the coords or if I just build an entirely new model to include the new groups.
Here is a working example of what I want to do. The data frame trim
has only the groups with a minimum number of observations. The dataframe remainder
has the–you guessed it–…remainder… of the groups.
When modeling the new groups, I am not sure if the index variable needs to start over at 0 or continues the counting as to not overlap with the groups in the original model…
import pymc as pm
import pymc.sampling.jax as pmjax
import pandas as pd
import numpy as np
import arviz as az
RANDOM_SEED = 98
rng = np.random.default_rng(RANDOM_SEED)
# Number of observations per group
n_obs = pm.draw(pm.DiscreteUniform.dist(lower=5, upper=100),100, random_seed=rng)
# Random means
individual_means = pm.draw(pm.Normal.dist(mu=2, sigma=3), len(n_obs))
# Data
data = pd.DataFrame(
{"data" : np.concatenate([pm.draw(pm.Normal.dist(mu=mu, sigma=3), n) for mu, n in zip(individual_means, n_obs)]),
"idx" : np.concatenate([[ii]*n for ii, n in enumerate(n_obs)]),
"n_obs" : np.concatenate([[n]*n for n in n_obs])}
)
# Trim data to remove groups with fewer then N observations
trim = data.query("n_obs > 20").copy(deep=True)
# Factorize idx after trimming
idx_factorized, idx_list = pd.factorize(trim.idx)
trim = trim.assign(idx_factorized = idx_factorized)
with pm.Model(coords = {"idx": idx_list}) as model:
# Multilevel mu
mu_prior_mu = pm.Normal("mu_prior_mu", 0, 5)
mu_prior_sigma = pm.HalfNormal("mu_prior_sigma", 3)
mu = pm.Normal("mu", mu_prior_mu, mu_prior_sigma, dims="idx")
# Multilevel sigma
sigma_prior = pm.HalfNormal("sigma_prior", 3)
sigma = pm.HalfNormal("sigma", sigma_prior, dims="idx")
# Observed data
y = pm.Normal('y', mu=mu[idx_factorized], sigma=sigma[idx_factorized], observed=trim.data.values)
# Sample
idata = pmjax.sample_numpyro_nuts(2000,chains=4, idata_kwargs=dict(log_likelihood = True), random_seed=RANDOM_SEED)
# Just crude check to see if the point estimate result matches the original parameters
MAE = np.mean(trim.groupby("idx").data.mean().values - az.summary(idata, var_names='mu')["mean"].values).round(5)
print(f"Mean absolute error: {MAE}")
# Remainder of data not used in the model
remainder = data.drop(trim.index)
# Factorize idx after trimming
remainder_idx_factorized, remainder_idx_list = pd.factorize(remainder.idx)
# Sloppy but just continues counting
remainder_idx_factorized_continue_count = np.max(idx_factorized) + 1 + remainder_idx_factorized
# assigining both factorized indices. Not sure if we continue the groups for the new model or start from 0 again
remainder = remainder.assign(idx_factorized = remainder_idx_factorized,
idx_factorized_continue_count = remainder_idx_factorized_continue_count)
Thanks!