Out of sample/model predictions in hierarchical model with new observed data

Hi all,

I know its been a few weeks, but I wanted to follow-up with this thread. I was able to spend some time on a few variations of the problem. I reused some of the code from @Benjamin to write an example with a few variations of out-of-model prediction.

Just to recap, I have done out-of-sample prediction in a hierarchical model before using pm.Data in the model and then pm.set_data to swap the data and make predictions. However, (I believe) this only works for cases in which the group index already exists within the hierarchical structure (i.e., if I originally have 10 groups, I can’t find the 14th group in the original data). Here are some things I tried and some observations. I would love to get some thoughts on whether or not this is the correct interpretation of what I observed. The basic model looks like this:

# Data
        g = pm.Data("g", frame["idx"].to_numpy())
        y = pm.Data("y", frame["data"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        eps_mu_g = pm.Normal("eps_mu_g", dims="idx")
        mu_g = pm.Deterministic("mu_g", mu + tau * eps_mu_g, dims="idx")

        # Observed data
        pm.Normal("y_obs", mu=mu_g[g], sigma=sigma, observed=y)
  1. mu_g and eps_mu_g are the shape of the number of groups. So in my first attempt, I only use pm.Flat for the params that are not group specific, so mu, tau, and sigma. When sampling the the posterior predictive, I found that the distribution was the same for all groups since no group specific information is being used. It confused me at first but makes perfect sense now. I thought that maybe some information would leak in from the backend but I realized I never told PyMC where to look. This is because my non-centered param eps_mu_g_new did not have the same name so the variable did not conflict with anything in the original inference data.

  2. In my second attempt I still used pm.Flat for mu, sigma, and tau, but this time I explicitly defined eps_mu_g_new to have mu/sigma that were taken directly from the inference data summary. New groups were assigned mu=0 and sigma=3. Now the model was able to use that previously learned information, but after the fact I realized that this still does not mean that the model sees the group specific params in the original inference data.

  3. In this attempt, I used an entirely new model and did not use pm.Flat. All of the variables were assigned values of mu and sigma that were taken from the inference data summary and new values were the same mu=0 and sigma=3 from above. In this case I was able to fully sample the new model since pm.Flat was not used for defining any of the variables.

  4. In the last attempt I followed the example from pymc labs where I use the old variable names and pm.Flat to tell the model to look at the inference data for existing groups and then defined a new variable for new groups. Those were then concatenated and then indexed by group number like usual.

Obviously (1) does not yield meaningful predictions on new groups since no group specific information carries over. Interestingly (or perhaps not to others), (2-4) gave me nearly identical predictions within each group.

One thing that I am concerned about is how models (2) and (4) can see any of the new data other than the group it belongs to. What I mean is that since we only sample the posterior predictive, the only value being used is the one used to index the mu_g param. The raw data for new groups is completely unobserved. In example (3), we sample a completely new model and all data are passed in as observed when defining the mean of the observed variable, y_obs.

I can see how (2) and (4) are advantageous when I only have some predicted group means for unseen groups where I can pass in the priors, such as the example from pymc labs. However, when I have actual observed data that comes in a later time, is the only way for the model to see the values of those observed data for me to train a new model and just use the summary of previous inference as priors in the new model? It feels like I am throwing away important information by just computing the mean/std of the results and then using those as priors.

I think in the case where I have few observations on new groups, the raw data will not matter much anyways since individual group means with insufficient data will be regressed to the pooled mean. However, I am more concerned about cases where I have new data that are sufficiently different from the group mean such that the group mean would be pulled away from the pooled mean if it were in the original sampling. However, I still want this new group to abide by the constraints imposed by the hierarchical model.

Thanks!