How to make inference on new data with a hierarchical model?

Hi, I can’t figure out how to make inference about latent parameters on new data in a hierarchical model.

First I simulated some data:

MU0 = 100
SIGMA0 = 10
MU1 = 5

NUM_USERS = 18
NUM_TRAIN_USERS = 10
MEAN_NUM_OBS = 5

def expand_list_of_list(lol):
    return [x for y in lol for x in y]

user_mean = pm.draw(pm.Normal.dist(MU0, SIGMA0), draws=NUM_USERS, random_seed=1)
user_sd = pm.draw(pm.HalfNormal.dist(MU1), draws=NUM_USERS, random_seed=1)
num_obs = pm.draw(pm.Poisson.dist(MEAN_NUM_OBS), draws=NUM_USERS, random_seed=1) + 1

sim_pd = (
    pd.DataFrame({
        'user_id': expand_list_of_list([[i] * x for i, x in enumerate(num_obs)]),
        'mean': expand_list_of_list([[x] * y for x, y in zip(user_mean, num_obs)]),
        'sigma': expand_list_of_list([[x] * y for x, y in zip(user_sd, num_obs)]),
        'value': expand_list_of_list([pm.draw(pm.Normal.dist(x, y), draws=z, random_seed=1) for x, y, z in zip(user_mean, user_sd, num_obs)]),
    })
)

train_pd = sim_pd.query(f"user_id < {NUM_TRAIN_USERS}")
test_pd = sim_pd.query(f"user_id >= {NUM_TRAIN_USERS}").reset_index(drop=True)

Then I constructed a hierarchical model like this using the training data:

with pm.Model(coords={'user_id': sorted(list(train_pd['user_id'].unique())), 'user_id_list': train_pd['user_id']}) as m:
    user_mu = pm.Normal('user_mu', MU0, dims='user_id')
    user_sigma = pm.HalfNormal('user_sigma', SIGMA0, dims='user_id')
    value = pm.Data('value', train_pd['value'], dims='user_id_list')
    y = pm.Normal('y', mu=user_mu[train_pd['user_id']], sigma=user_sigma[train_pd['user_id']], observed=value, dims='user_id_list')
    idata = pm.sample()

That looked okay, but if I try to make inference on the new data like this:

with m:
    pm.set_data(
        {'value': test_pd['value']},
        coords={'user_id': sorted(list(test_pd['user_id'].unique())), 'user_id_list': test_pd['user_id']}
    )
    
    idata.extend(pm.sample_posterior_predictive(idata))

I’d get this error:

ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (45,) and arg 1 with shape (65,).

Can I get some help on how I can fix this? Thank you.

Have you tried the suggestion here of also resetting the dimension of the model

1 Like

So I have not been able to get this to work, but I believe this is on the right track. The issue is that you are trying to make inferences on groups outside of the model. In this case, I think you need to actually define a second model and then call sample_posterior_predictive to make predictions on these unseen groups. Following this blog post and this previous question, I think it is something like this:

with pm.Model( coords={'obs_idx' : train_pd.reset_index().index.tolist() , 
                       'user_id' : sorted(list(train_pd['user_id'].unique()))}
              ) as m:
    user_mu    = pm.Normal('user_mu', mu=MU0, sigma=SIGMA0, dims=('user_id',))
    user_sigma = pm.HalfNormal('user_sigma', MU1 , dims=('user_id',))
    y          = pm.Normal('y', mu=user_mu[train_pd['user_id'].to_numpy()], sigma=user_sigma[train_pd['user_id'].to_numpy()], observed=train_pd['value'], dims=('obs_idx',))
    idata      = pm.sample()

A few things:

  1. You were defining the value variable as both mutable and observed. If you wanted to have a value as mutable it should be the column you use for indexing since that is what would change. By having the value variable in a data container, the model expect that for any future inference. Thus if you are making inferences on unseen data for values, you would still be required to pass in something into that container. I believe that if you set predictions=True it does not matter, but I think it is best to just leave it outside of a container.
  2. The dimensions of y should be the number of rows in your data frame. If I understand correctly you want a mean for each group, but your y should still be in the space of the observed data. The indexing by user id should be taking care of matching the mean/sigma for each user to the observed values.
  3. You were specifying train_pd['user_id'] as a coord, but you actually want that as an indexing variable not a coord. By replacing it with obs_idx, we can specify the number of data points we observe in y.

I don’t think you can just use pm.set_data to make inferences on unseen groups. If I understand correctly, this is not out of sample, but more of an out of model problem. Thus, you would need to do something like the following, where you specify new coords and new values of mu/sigma but reuse the old idata.

new_coords= {'obs_idx': test_pd.index.tolist(), 
             'user_id': sorted(list(test_pd['user_id'].unique()))}

# Out of model prediction
with pm.Model(coords=new_coords) as m_oom:
    user_mu_oom    = pm.Normal('user_mu', mu=MU0, sigma=SIGMA0, dims=('user_id',))
    user_sigma_oom = pm.HalfNormal('user_sigma', MU1 , dims=('user_id',))
    y = pm.Normal('y', mu= user_mu_oom[test_pd['user_id'].to_numpy()], sigma=user_sigma_oom[test_pd['user_id'].to_numpy()], dims=("obs_idx",))

Like I said, this is not working for me yet, but i’ll keep hacking at it. Maybe @ricardoV94 can weigh in since he provided the solution to that linked post.

I hope this helps some even though its not a full solution.

1 Like

Thanks for the suggestion @iavicenna. In the original code I have already set the dims in the observed variable. But I guess maybe I need to also do it for other variables? Will look at this some more.

@JAB thanks for the detailed response! Let me try your suggestions and I will report back.