Hello there,
Here is my setup: I have some repeated measurements on individuals (a set of n_i
visits at ages t_ij
for individual i
, each measuring an univariate continuous outcome y_ij
).
I want to model outcome with a mixed effects model using a fully Bayesian approach. I start by a simple LMM: fixed and random intercepts and slopes with respect to individuals’ ages.
I split my dataset into 3 parts:
- a train part: all data of 80% of individuals
- on remaining 20% of individuals
- a “test known” part: 2/3 of visits of the remaining 20% of individuals
- a “test to predict” part: the remaining 1/3 of visits from these same individuals
I’d like to:
- train my model (on train part) so to estimate posterior distribution of fixed effects only
- then estimate posterior distribution of random effects on hold-out individuals, given data from the “test known” part and train data (marginalization on fixed effects according to fixed during train posterior distribution of fixed effects)
- finally estimate whatever model metric on posterior predictive distribution of
y_ij
on “test to predict” part.
I understand that I could directly train my model with both train & “test known” data (i.e. skipped step 2. - and it worked) but I don’t want to do it by design.
Before posting, I read these 2 posts which have similar questions:
- Predict on unseen group in hierarchical model
-
How do we predict on new unseen groups in a hierarchical model in PyMC3?
To my understanding they differ from my question as in case of unknown “group” (individual) they’d like to estimate posterior distribution for the population level of the hierarchy which I don’t want: I want to estimate hold-out random effects, fixed effects being locked, as we could do in a point estimates approach.
Here’s my code:
def model_factory(df):
# <!> update levels (otherwise all levels from main df are kept)
df_ix = df.index.remove_unused_levels()
coords={"obs_id": np.arange(len(df)),
"individuals": df_ix.levels[0]}
with pm.Model(coords=coords) as lmm:
# Data from outside world (to be seen in graphical model)
inds_ix_v = pm.Data('individual_i', df_ix.codes[0], dims='obs_id')
t_ij_v = pm.Data('age_ij', df['T'], dims='obs_id')
# Fixed effects (model parameters)
a0 = pm.Normal('a0', mu=mu_a0, sigma=std_a0)
b0 = pm.Normal('b0', mu=mu_b0, sigma=std_b0)
precision = pm.Gamma('tau', mu=mu_precision, sigma=var_precision**.5)
precision_a = pm.Gamma('tau_a', mu=mu_precision_a, sigma=var_precision_a**.5)
precision_b = pm.Gamma('tau_b', mu=mu_precision_b, sigma=var_precision_b**.5)
# Random effects (individual parameters), orthogonal
# Non-centered model: https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/
a_offset = pm.Normal('ai_offset', mu=0, tau=1, dims="individuals")
a = pm.Deterministic('a_i', a0 + a_offset/precision_a**.5, dims="individuals")
b_offset = pm.Normal('bi_offset', mu=0, tau=1, dims="individuals")
b = pm.Deterministic('b_i', b0 + b_offset/precision_b**.5, dims="individuals")
# Model (likelihood) (a_i _||_ b_i _||_ e_ij)
y_hat_ij = a[inds_ix_v] + b[inds_ix_v] * t_ij_v
y = pm.Normal('y', mu=y_hat_ij, tau=precision, observed=df['Y'], dims="obs_id")
return lmm
lmm_train = model_factory( df.xs(True, level='TRAIN') )
with lmm_train:
train_trace = pm.sample(4000, tune=2000, # burn-in 2k iterations + 4k samples
random_seed=seed, return_inferencedata=False)
Then I tried this for step 2 but visibly it does not work (not taking into account test known data but only being near fixed effects mean):
train_trace_wo_ranef = copy.deepcopy(train_trace)
train_trace_wo_ranef.remove_values('ai_offset')
train_trace_wo_ranef.remove_values('bi_offset')
train_trace_wo_ranef.remove_values('a_i')
train_trace_wo_ranef.remove_values('b_i')
lmm_test_known = model_factory( df.xs(True, level='TEST_KNOWN') )
with lmm_test_known:
#test_trace = pm.sample(trace=train_trace)
post_samples = pm.sample_posterior_predictive(
train_trace_wo_ranef, random_seed=seed,
var_names=["ai_offset", "bi_offset", "a_i", "b_i"]
)