How to make inference on new data with a hierarchical model?

So I have not been able to get this to work, but I believe this is on the right track. The issue is that you are trying to make inferences on groups outside of the model. In this case, I think you need to actually define a second model and then call sample_posterior_predictive to make predictions on these unseen groups. Following this blog post and this previous question, I think it is something like this:

with pm.Model( coords={'obs_idx' : train_pd.reset_index().index.tolist() , 
                       'user_id' : sorted(list(train_pd['user_id'].unique()))}
              ) as m:
    user_mu    = pm.Normal('user_mu', mu=MU0, sigma=SIGMA0, dims=('user_id',))
    user_sigma = pm.HalfNormal('user_sigma', MU1 , dims=('user_id',))
    y          = pm.Normal('y', mu=user_mu[train_pd['user_id'].to_numpy()], sigma=user_sigma[train_pd['user_id'].to_numpy()], observed=train_pd['value'], dims=('obs_idx',))
    idata      = pm.sample()

A few things:

  1. You were defining the value variable as both mutable and observed. If you wanted to have a value as mutable it should be the column you use for indexing since that is what would change. By having the value variable in a data container, the model expect that for any future inference. Thus if you are making inferences on unseen data for values, you would still be required to pass in something into that container. I believe that if you set predictions=True it does not matter, but I think it is best to just leave it outside of a container.
  2. The dimensions of y should be the number of rows in your data frame. If I understand correctly you want a mean for each group, but your y should still be in the space of the observed data. The indexing by user id should be taking care of matching the mean/sigma for each user to the observed values.
  3. You were specifying train_pd['user_id'] as a coord, but you actually want that as an indexing variable not a coord. By replacing it with obs_idx, we can specify the number of data points we observe in y.

I don’t think you can just use pm.set_data to make inferences on unseen groups. If I understand correctly, this is not out of sample, but more of an out of model problem. Thus, you would need to do something like the following, where you specify new coords and new values of mu/sigma but reuse the old idata.

new_coords= {'obs_idx': test_pd.index.tolist(), 
             'user_id': sorted(list(test_pd['user_id'].unique()))}

# Out of model prediction
with pm.Model(coords=new_coords) as m_oom:
    user_mu_oom    = pm.Normal('user_mu', mu=MU0, sigma=SIGMA0, dims=('user_id',))
    user_sigma_oom = pm.HalfNormal('user_sigma', MU1 , dims=('user_id',))
    y = pm.Normal('y', mu= user_mu_oom[test_pd['user_id'].to_numpy()], sigma=user_sigma_oom[test_pd['user_id'].to_numpy()], dims=("obs_idx",))

Like I said, this is not working for me yet, but i’ll keep hacking at it. Maybe @ricardoV94 can weigh in since he provided the solution to that linked post.

I hope this helps some even though its not a full solution.

1 Like