How to properly do out-of-sample prediction for hierarchical model

Hello everyone. I’ve been checking various related posts but so far still unable to perform sample_posterior_predictive.

I have defined a model constructor function

def model_ordered_logistic_partial_pooled(coords, cohort=cohort_):       
    with pm.Model(coords=coords) as model:
        x_0 = pm.Data("x_0", training_data['score_chg'].astype(float).values, mutable=True)
        x_1 = pm.Data("x_1", training_data['odds_zs'].values, mutable=True)
        y = pm.Data("y", training_data['cohort_4gp'].values - 1, mutable=True)
    
        race_idx = pm.Data("race_idx", race_id_training, dims="obs_id", mutable=True)
        
        # Top level prior
        a_mean = pm.Normal('a_mean', 0., 10.)
        a_stdev = pm.HalfNormal('a_stddev', 10.)
        
        # Individual race
        a = pm.Normal(
            'a', a_mean, a_stdev,
            transform=pm.distributions.transforms.univariate_ordered,
            shape=(len(coords['race']), cohort - 1),
            dims="race",
        )
        a_ = pm.Deterministic('a_', a, dims="race")
        a_t = pt.as_tensor_variable(a_)

        b_mean = pm.Normal('b_mean', [-0.05, 1., 0., 0.], 4*[5.])
        b_stddev = pm.HalfNormal('b_stddev', 4*[10.])
        b0 = pm.Normal('b0', b_mean[0], b_stddev[0], dims="race", shape=(len(coords['race'])))
        b1 = pm.Normal('b1', b_mean[1], b_stddev[1], dims="race", shape=(len(coords['race'])))

        phi = pm.Deterministic("phi", b0[race_idx] * x_0.get_value()
                                   + b1[race_idx] * x_1.get_value())

        resp_obs = pm.OrderedLogistic(
            'cohort_obs', phi, a_t[race_idx],
            observed=y.get_value(),
            dims="obs_id"
        )

    return model

Sampling is successful

olmpp_4 = model_ordered_logistic_partial_pooled(coords_training, 4)

with olmpp_4:
    trace_olmpp_4 = pm.sample(10, tune=10, chains=1, random_seed=42, return_inferencedata=True, idata_kwargs={'log_likelihood':True})

However, I then attempted to do out-of-sample testing. I created new model with new index coord dictionary to act on the trace I got from the training model

race_id_testing, races_testing = testing_data.race_idx.factorize(sort=True)

coords_testing = {
    "race": races_testing,
    "race_id": race_id_testing,
    "obs_id": np.arange(len(testing_data))
}

olmpp_4_oos = model_ordered_logistic_partial_pooled(coords_testing, 4)

with olmpp_4_oos:
    pm.set_data({'x_0': testing_data['score_chg'].astype(float).values,
                'x_1': testing_data['odds_zs'].values,
                'y': testing_data['cohort_4gp'].values - 1,
                "race_idx": race_id_testing})
    
    pp_olmpp_4 = pm.sample_posterior_predictive(trace_olmpp_4)
    trace_olmpp_4.extend(pp_olmpp_4)

But unfortunately got an error

TypeError: ("The type's shape ((1498,)) is not compatible with the data's ((5949,))", 'Container name "b0"')

where 1498 is the length of testing set’s races_testing and 5949 training set’s races_training. What did I miss? I also tried working with pytensor shared as suggested in here but got similar results…thanks in advance for any suggestion.

We discuss this in this blogpost: Out of model predictions with PyMC - PyMC Labs

Check the section # Simulating new groups in hierarchical models, but you might want to read the whole thing to understand exactly what sample_posterior_predictive can and cannot do and how it works.

Thanks. I read the page, but it seems like what I’ll have to do is quite different due to the use of coord (in my case one row of data is one data point, instead of one group).

Edit: OK I think I get the gists now. Basically to perform out-of-sampling prediction, I have to create a new model that covers both training and test sets, and make some additional assumptions regarding the unseen groups.

If you don’t care about the original groups, you can have a simpler model that just takes draws for the new groups/individuals. But yes, that’s the simpler approach either way.

Slowly losing hope; the additional model for prediction is still showing

TypeError: ("The type's shape ((7447,)) is not compatible with the data's ((5949,))", 'Container name "b0"')

where 7447 is the overall number of groups in the full data set; 5949 is the number of groups in the training model.

I thought this thread
https://discourse.pymc.io/t/out-of-sample-prediction/11794

is relevant to my issue but not sure how to apply the solution.

I suggest you try to start with the simplest case and expand gradually. That should help you pinpoint the problem faster.

Here is a simplified example:

import pymc as pm

idx = [0, 0, 0, 1, 1, 1]
coords = {
    "id": [0, 1],
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=coords) as m:
    x = pm.Data("x", [0, 0, 0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = b[idx] * x
    obs = pm.Normal("obs", mu, observed=[0, 1, 2, 3, 4, 5], dims=("obs_idx",))

    idata = pm.sample()

# Predict two new groups
idx = [0, 0, 1]
new_coords = {
    "id": [2, 3],
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=new_coords) as pred_m:
    x = pm.Data("x", [0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    # Needs a new name, so old b is not used!
    new_b = pm.Normal("new_b", mu_group, dims=("id",))
    mu = new_b[idx] * x
    obs = pm.Normal("obs", mu, dims=("obs_idx",))

    idata = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=True)

Thanks @ricardoV94, I tried running your sample code but got the error

AssertionError: SpecifyShape: dim 0 of input has shape 6, expected 3.

on the sample_posterior_predictive line. Could this be the PyMC 5 vs 3 behaviour discrepancy? I recall seeing that somewhere.

There was a small typo in my example (edited). Here it is running on Google Colab: Google Colab

If you are using an old version of PyMC do update it if you can.

Thanks. I noticed that the new idx has to be [0, 0, 1], instead of say [9, 9, 10]. Is there a particular reason for this requirement?

You would need the array to have 9 dummy entries if you wand idx values of 9 and 10 to work

Update: I think I’ve seen someone else making the same observation, that it turns out sample_posterior_predictive works only if the number of out-of-sample forecast is the same as training set.

That’s not the case, but if you are using dims, you may need to set predictions=True so InferenceData doesn’t complain.

Do you have an example?

num_cohort = 2

data_list = [-1.2083030673141326,
 0.5652466185040567,
 -0.7851719239971027,
 0.20513500717041422,
 0.5652466185040567,
 -0.1549766041632283,
 2.0056930638386268,
 -1.2983309701475432,
 0.745302424170878,
 1.1054140355045206,
 -0.42506031266346017,
 -0.6951440211636921,
 -1.280325389580861,
 0.6552745213374673,
 -0.7873164649263759,
 -0.6484249514565011,
 0.5817570249909609,
 2.089722028378173,
 -0.410325214079573,
 -0.8746197019645828,
 0.8198567623678892,
 -0.410325214079573,
 -1.0214478733470218,
 0.6611236041166038,
 -0.9220549344577036,
 -0.8455301825797763,
 0.24973032867305697,
 0.9193219076049202,
 -0.6111731299536243,
 0.05841844897823892]

index_list = [0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 3,
 3,
 3,
 3,
 3]

output_list = [4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 3,
 2,
 4,
 2,
 1,
 2,
 1,
 4,
 4,
 4,
 2,
 4,
 3,
 3,
 4,
 1,
 2,
 2,
 4,
 4,
 4]

prediction_len = 30
training_data_lite = pd.DataFrame({'in1':data_list, 'in2':data_list, 'out':output_list, 'race_idx':index_list})
testing_data_lite = pd.DataFrame({'in1':data_list[:prediction_len], 'in2':data_list[:prediction_len],'race_idx':index_list[:prediction_len]})
# Re-assign consecutive index starting from zero
training_race_idx_mapping = dict(zip(set(training_data_lite.race_idx), range(len(set(training_data_lite.race_idx)))))
training_data_lite = training_data_lite.replace({"race_idx": training_race_idx_mapping})
testing_race_idx_mapping = dict(zip(set(testing_data_lite.race_idx), range(len(set(testing_data_lite.race_idx)))))
testing_data_lite = testing_data_lite.replace({"race_idx": testing_race_idx_mapping})

temp_df = training_data_lite

idx = temp_df.race_idx
_, races_full = temp_df.race_idx.factorize(sort=True)
coords = {
    "id": races_full,
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=coords) as olmpp_4:
    x_0 = pm.Data("x_0", temp_df['in1'].astype(float).values, mutable=True, dims=("obs_idx",))
    x_1 = pm.Data("x_1", temp_df['in2'].values, mutable=True, dims=("obs_idx",))
    y = pm.Data("y", temp_df['out'].values - 1, mutable=True, dims=("obs_idx",))

    b_mean = pm.Normal('b_mean', [-0.05, 1., 0., 0.], 4*[5.])
    b_stddev = pm.HalfNormal('b_stddev', 4*[10.])
    b0 = pm.Normal('b0', b_mean[0], b_stddev[0], dims=("id",))
    b1 = pm.Normal('b1', b_mean[1], b_stddev[1], dims=("id",))

    phi = b0[idx] * x_0\
        + b1[idx] * x_1\

    a_mean = pm.Normal('a_mean', 0., 10.)
    a_stdev = pm.HalfNormal('a_stddev', 10.)
    
    a = pm.Normal(
         'a', a_mean, a_stdev,
         transform=pm.distributions.transforms.univariate_ordered,
         shape=(len(coords['id']), 4 - 1),
         testval=len(coords['id']) * [[-2.512425133191401, -1.1671272329327658, -0.39780512413988334]],
         dims=("id",),
     )

    obs = pm.OrderedLogistic("obs", phi, a[idx], observed=y, dims=("obs_idx",))

    idata = pm.sample(10, tune=10, chains=4, return_inferencedata=True, idata_kwargs={'log_likelihood':True})

temp_df2 = testing_data_lite
idx2 = temp_df2.race_idx
_, races_full2 = temp_df2.race_idx.factorize(sort=True)
new_coords = {
    "id": races_full2,
    "obs_idx": list(range(len(idx2))),
}
with pm.Model(coords=new_coords) as olmpp_4_pre:
    x_0 = pm.Data("x_0", temp_df2['in1'].astype(float).values, mutable=True, dims=("obs_idx",))
    x_1 = pm.Data("x_1", temp_df2['in2'].values, mutable=True, dims=("obs_idx",))
    
    b_mean = pm.Normal('b_mean', [-0.05, 1., 0., 0.], 4*[5.])
    b_stddev = pm.HalfNormal('b_stddev', 4*[10.])
    new_b0 = pm.Normal('new_b0', b_mean[0], b_stddev[0], dims=("id",))
    new_b1 = pm.Normal('new_b1', b_mean[1], b_stddev[1], dims=("id",))

    phi = new_b0[idx] * x_0\
        + new_b1[idx] * x_1\

    a_mean = pm.Normal('a_mean', 0., 10.)
    a_stdev = pm.HalfNormal('a_stddev', 10.)
    
    new_a = pm.Normal(
         'new_a', a_mean, a_stdev,
         transform=pm.distributions.transforms.univariate_ordered,
         shape=(len(coords['id']), 4 - 1),
         dims=("id",),
     )
    
    obs = pm.OrderedLogistic("obs", phi, new_a[idx], dims=("obs_idx",))

    idata_pre = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=False)

If prediction_len < 30, I got an error

temp_df2 = testing_data_lite
idx2 = temp_df2.race_idx

Doesn’t look like you are using idx2 in the prediction model but still idx. Certainly you are doing some indexing error that has nothing to do with sample_posterior_predictive

Thanks @ricardoV94. Not sure I understand your point, I do have separate index for testing versus training

idx = temp_df.race_idx

and

idx2 = temp_df2.race_idx

Isn’t this the right way to do it?

Ignore that you have two models, and focus just on the predictive model. If you are creating some variables and indexing, you have to make sure those are consistent with each other. If you create a variable with ten entries and have an index with a value of 20 you are going to get an index error.

You can call pm.sample_prior_predictive on the predictive model to confirm you have an error in how you defined the model.

Also for debugging I suggest you trim down your model to the bare minimum that still reproduces the error, one intercept per group member is probably enough. This will make it much easier to find the source of the problem or for someone to find it for you.

And don’t be shy about posting the error messages you get, those are quite useful

Oh wait, I see my error, ignore me.

Actually let me post this because I’m not sure I’m doing this correctly. Would this be the correct way to modify your google colab @ricardoV94 to make predictions on one of the previous groups and two new groups? It’s unclear to me if I’m getting predictions for all 4 groups doing it this way or just the three desired groups.

idx = [0, 0, 0, 1, 1, 1]
coords = {
    "id": [0, 1],
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=coords) as m:
    x = pm.Data("x", [0, 0, 0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = b[idx] * x
    obs = pm.Normal("obs", mu, observed=[0, 1, 2, 3, 4, 5], dims=("obs_idx",))

    idata = pm.sample()

# Predict two new groups
new_idx = [0, 0, 1, 2]
new_coords = {
    "id":[0,1],
    "new_id": [2, 3],
    "obs_idx": list(range(len(new_idx))),
}
with pm.Model(coords=new_coords) as pred_m:
    x = pm.Data("x", [0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    # Needs a new name, so old b is not used!
    new_b = pm.Normal("new_b", mu_group, dims=("new_id",))
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = pm.math.concatenate([new_b[new_idx], b[idx]]) * x
    obs = pm.Normal("obs", mu, dims=("obs_idx",))

    idata = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=True)

Edit: No, that can’t be right because “id” is still [0,1] but if I change “id” to [1] for example, i.e.,

# Predict two new groups
new_idx = [0, 0, 1, 2]
new_coords = {
    "id":[1],
    "new_id": [2, 3],
    "obs_idx": list(range(len(new_idx))),
}
with pm.Model(coords=new_coords) as pred_m:
    x = pm.Data("x", [0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    # Needs a new name, so old b is not used!
    new_b = pm.Normal("new_b", mu_group, dims=("new_id",))
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = pm.math.concatenate([new_b[new_idx], b[idx]]) * x
    obs = pm.Normal("obs", mu, dims=("obs_idx",))

    idata = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=True)

I get

TypeError: ("The type's shape ((1,)) is not compatible with the data's ((2,))", 'Container name "b"')