How to properly do out-of-sample prediction for hierarchical model

peeeffchang · August 22, 2023, 9:07am

Hello everyone. I’ve been checking various related posts but so far still unable to perform sample_posterior_predictive.

I have defined a model constructor function

def model_ordered_logistic_partial_pooled(coords, cohort=cohort_):       
    with pm.Model(coords=coords) as model:
        x_0 = pm.Data("x_0", training_data['score_chg'].astype(float).values, mutable=True)
        x_1 = pm.Data("x_1", training_data['odds_zs'].values, mutable=True)
        y = pm.Data("y", training_data['cohort_4gp'].values - 1, mutable=True)
    
        race_idx = pm.Data("race_idx", race_id_training, dims="obs_id", mutable=True)
        
        # Top level prior
        a_mean = pm.Normal('a_mean', 0., 10.)
        a_stdev = pm.HalfNormal('a_stddev', 10.)
        
        # Individual race
        a = pm.Normal(
            'a', a_mean, a_stdev,
            transform=pm.distributions.transforms.univariate_ordered,
            shape=(len(coords['race']), cohort - 1),
            dims="race",
        )
        a_ = pm.Deterministic('a_', a, dims="race")
        a_t = pt.as_tensor_variable(a_)

        b_mean = pm.Normal('b_mean', [-0.05, 1., 0., 0.], 4*[5.])
        b_stddev = pm.HalfNormal('b_stddev', 4*[10.])
        b0 = pm.Normal('b0', b_mean[0], b_stddev[0], dims="race", shape=(len(coords['race'])))
        b1 = pm.Normal('b1', b_mean[1], b_stddev[1], dims="race", shape=(len(coords['race'])))

        phi = pm.Deterministic("phi", b0[race_idx] * x_0.get_value()
                                   + b1[race_idx] * x_1.get_value())

        resp_obs = pm.OrderedLogistic(
            'cohort_obs', phi, a_t[race_idx],
            observed=y.get_value(),
            dims="obs_id"
        )

    return model

Sampling is successful

olmpp_4 = model_ordered_logistic_partial_pooled(coords_training, 4)

with olmpp_4:
    trace_olmpp_4 = pm.sample(10, tune=10, chains=1, random_seed=42, return_inferencedata=True, idata_kwargs={'log_likelihood':True})

However, I then attempted to do out-of-sample testing. I created new model with new index coord dictionary to act on the trace I got from the training model

race_id_testing, races_testing = testing_data.race_idx.factorize(sort=True)

coords_testing = {
    "race": races_testing,
    "race_id": race_id_testing,
    "obs_id": np.arange(len(testing_data))
}

olmpp_4_oos = model_ordered_logistic_partial_pooled(coords_testing, 4)

with olmpp_4_oos:
    pm.set_data({'x_0': testing_data['score_chg'].astype(float).values,
                'x_1': testing_data['odds_zs'].values,
                'y': testing_data['cohort_4gp'].values - 1,
                "race_idx": race_id_testing})
    
    pp_olmpp_4 = pm.sample_posterior_predictive(trace_olmpp_4)
    trace_olmpp_4.extend(pp_olmpp_4)

But unfortunately got an error

TypeError: ("The type's shape ((1498,)) is not compatible with the data's ((5949,))", 'Container name "b0"')

where 1498 is the length of testing set’s races_testing and 5949 training set’s races_training. What did I miss? I also tried working with pytensor shared as suggested in here but got similar results…thanks in advance for any suggestion.

ricardoV94 · August 22, 2023, 9:45am

We discuss this in this blogpost: Out of model predictions with PyMC - PyMC Labs

Check the section # Simulating new groups in hierarchical models, but you might want to read the whole thing to understand exactly what sample_posterior_predictive can and cannot do and how it works.

peeeffchang · August 22, 2023, 2:23pm

Thanks. I read the page, but it seems like what I’ll have to do is quite different due to the use of coord (in my case one row of data is one data point, instead of one group).

Edit: OK I think I get the gists now. Basically to perform out-of-sampling prediction, I have to create a new model that covers both training and test sets, and make some additional assumptions regarding the unseen groups.

ricardoV94 · August 23, 2023, 1:49pm

If you don’t care about the original groups, you can have a simpler model that just takes draws for the new groups/individuals. But yes, that’s the simpler approach either way.

peeeffchang · August 24, 2023, 10:05am

Slowly losing hope; the additional model for prediction is still showing

TypeError: ("The type's shape ((7447,)) is not compatible with the data's ((5949,))", 'Container name "b0"')

where 7447 is the overall number of groups in the full data set; 5949 is the number of groups in the training model.

I thought this thread
https://discourse.pymc.io/t/out-of-sample-prediction/11794

is relevant to my issue but not sure how to apply the solution.

ricardoV94 · August 24, 2023, 10:52am

I suggest you try to start with the simplest case and expand gradually. That should help you pinpoint the problem faster.

Here is a simplified example:

import pymc as pm

idx = [0, 0, 0, 1, 1, 1]
coords = {
    "id": [0, 1],
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=coords) as m:
    x = pm.Data("x", [0, 0, 0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = b[idx] * x
    obs = pm.Normal("obs", mu, observed=[0, 1, 2, 3, 4, 5], dims=("obs_idx",))

    idata = pm.sample()

# Predict two new groups
idx = [0, 0, 1]
new_coords = {
    "id": [2, 3],
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=new_coords) as pred_m:
    x = pm.Data("x", [0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    # Needs a new name, so old b is not used!
    new_b = pm.Normal("new_b", mu_group, dims=("id",))
    mu = new_b[idx] * x
    obs = pm.Normal("obs", mu, dims=("obs_idx",))

    idata = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=True)

peeeffchang · August 25, 2023, 8:51am

ricardoV94:

import pymc as pm

idx = [0, 0, 0, 1, 1, 1]
coords = {
    "id": [0, 1],
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=coords) as m:
    x = pm.Data("x", [0, 0, 0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = b[idx] * x
    obs = pm.Normal("obs", mu, observed=[0, 1, 2, 3, 4, 5], dims=("obs_idx",))

    idata = pm.sample()

# Predict two new groups
idx = [0, 0, 1]
new_coords = {
    "id": [2, 3],
    "obs_idx": list(range(len(new_idx))),
}
with pm.Model(coords=new_coords) as pred_m:
    x = pm.Data("x", [0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    # Needs a new name, so old b is not used!
    new_b = pm.Normal("new_b", mu_group, dims=("id",))
    mu = new_b[idx] * x
    obs = pm.Normal("obs", mu, dims=("obs_idx",))

    idata = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=True)

Thanks @ricardoV94, I tried running your sample code but got the error

AssertionError: SpecifyShape: dim 0 of input has shape 6, expected 3.

on the sample_posterior_predictive line. Could this be the PyMC 5 vs 3 behaviour discrepancy? I recall seeing that somewhere.

ricardoV94 · August 25, 2023, 9:19am

There was a small typo in my example (edited). Here it is running on Google Colab: Google Colab

If you are using an old version of PyMC do update it if you can.

peeeffchang · August 29, 2023, 9:13am

Thanks. I noticed that the new idx has to be [0, 0, 1], instead of say [9, 9, 10]. Is there a particular reason for this requirement?

ricardoV94 · August 29, 2023, 12:31pm

You would need the array to have 9 dummy entries if you wand idx values of 9 and 10 to work

peeeffchang · September 12, 2023, 9:50am

Update: I think I’ve seen someone else making the same observation, that it turns out sample_posterior_predictive works only if the number of out-of-sample forecast is the same as training set.

ricardoV94 · September 12, 2023, 10:08am

That’s not the case, but if you are using dims, you may need to set predictions=True so InferenceData doesn’t complain.

Do you have an example?

peeeffchang · September 13, 2023, 8:53am

num_cohort = 2

data_list = [-1.2083030673141326,
 0.5652466185040567,
 -0.7851719239971027,
 0.20513500717041422,
 0.5652466185040567,
 -0.1549766041632283,
 2.0056930638386268,
 -1.2983309701475432,
 0.745302424170878,
 1.1054140355045206,
 -0.42506031266346017,
 -0.6951440211636921,
 -1.280325389580861,
 0.6552745213374673,
 -0.7873164649263759,
 -0.6484249514565011,
 0.5817570249909609,
 2.089722028378173,
 -0.410325214079573,
 -0.8746197019645828,
 0.8198567623678892,
 -0.410325214079573,
 -1.0214478733470218,
 0.6611236041166038,
 -0.9220549344577036,
 -0.8455301825797763,
 0.24973032867305697,
 0.9193219076049202,
 -0.6111731299536243,
 0.05841844897823892]

index_list = [0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 3,
 3,
 3,
 3,
 3]

output_list = [4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 3,
 2,
 4,
 2,
 1,
 2,
 1,
 4,
 4,
 4,
 2,
 4,
 3,
 3,
 4,
 1,
 2,
 2,
 4,
 4,
 4]

prediction_len = 30
training_data_lite = pd.DataFrame({'in1':data_list, 'in2':data_list, 'out':output_list, 'race_idx':index_list})
testing_data_lite = pd.DataFrame({'in1':data_list[:prediction_len], 'in2':data_list[:prediction_len],'race_idx':index_list[:prediction_len]})
# Re-assign consecutive index starting from zero
training_race_idx_mapping = dict(zip(set(training_data_lite.race_idx), range(len(set(training_data_lite.race_idx)))))
training_data_lite = training_data_lite.replace({"race_idx": training_race_idx_mapping})
testing_race_idx_mapping = dict(zip(set(testing_data_lite.race_idx), range(len(set(testing_data_lite.race_idx)))))
testing_data_lite = testing_data_lite.replace({"race_idx": testing_race_idx_mapping})

temp_df = training_data_lite

idx = temp_df.race_idx
_, races_full = temp_df.race_idx.factorize(sort=True)
coords = {
    "id": races_full,
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=coords) as olmpp_4:
    x_0 = pm.Data("x_0", temp_df['in1'].astype(float).values, mutable=True, dims=("obs_idx",))
    x_1 = pm.Data("x_1", temp_df['in2'].values, mutable=True, dims=("obs_idx",))
    y = pm.Data("y", temp_df['out'].values - 1, mutable=True, dims=("obs_idx",))

    b_mean = pm.Normal('b_mean', [-0.05, 1., 0., 0.], 4*[5.])
    b_stddev = pm.HalfNormal('b_stddev', 4*[10.])
    b0 = pm.Normal('b0', b_mean[0], b_stddev[0], dims=("id",))
    b1 = pm.Normal('b1', b_mean[1], b_stddev[1], dims=("id",))

    phi = b0[idx] * x_0\
        + b1[idx] * x_1\

    a_mean = pm.Normal('a_mean', 0., 10.)
    a_stdev = pm.HalfNormal('a_stddev', 10.)
    
    a = pm.Normal(
         'a', a_mean, a_stdev,
         transform=pm.distributions.transforms.univariate_ordered,
         shape=(len(coords['id']), 4 - 1),
         testval=len(coords['id']) * [[-2.512425133191401, -1.1671272329327658, -0.39780512413988334]],
         dims=("id",),
     )

    obs = pm.OrderedLogistic("obs", phi, a[idx], observed=y, dims=("obs_idx",))

    idata = pm.sample(10, tune=10, chains=4, return_inferencedata=True, idata_kwargs={'log_likelihood':True})

temp_df2 = testing_data_lite
idx2 = temp_df2.race_idx
_, races_full2 = temp_df2.race_idx.factorize(sort=True)
new_coords = {
    "id": races_full2,
    "obs_idx": list(range(len(idx2))),
}
with pm.Model(coords=new_coords) as olmpp_4_pre:
    x_0 = pm.Data("x_0", temp_df2['in1'].astype(float).values, mutable=True, dims=("obs_idx",))
    x_1 = pm.Data("x_1", temp_df2['in2'].values, mutable=True, dims=("obs_idx",))
    
    b_mean = pm.Normal('b_mean', [-0.05, 1., 0., 0.], 4*[5.])
    b_stddev = pm.HalfNormal('b_stddev', 4*[10.])
    new_b0 = pm.Normal('new_b0', b_mean[0], b_stddev[0], dims=("id",))
    new_b1 = pm.Normal('new_b1', b_mean[1], b_stddev[1], dims=("id",))

    phi = new_b0[idx] * x_0\
        + new_b1[idx] * x_1\

    a_mean = pm.Normal('a_mean', 0., 10.)
    a_stdev = pm.HalfNormal('a_stddev', 10.)
    
    new_a = pm.Normal(
         'new_a', a_mean, a_stdev,
         transform=pm.distributions.transforms.univariate_ordered,
         shape=(len(coords['id']), 4 - 1),
         dims=("id",),
     )
    
    obs = pm.OrderedLogistic("obs", phi, new_a[idx], dims=("obs_idx",))

    idata_pre = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=False)

If prediction_len < 30, I got an error

ricardoV94 · September 13, 2023, 11:15am

temp_df2 = testing_data_lite
idx2 = temp_df2.race_idx

Doesn’t look like you are using idx2 in the prediction model but still idx. Certainly you are doing some indexing error that has nothing to do with sample_posterior_predictive

peeeffchang · September 13, 2023, 11:19am

Thanks @ricardoV94. Not sure I understand your point, I do have separate index for testing versus training

idx = temp_df.race_idx

and

idx2 = temp_df2.race_idx

Isn’t this the right way to do it?

ricardoV94 · September 13, 2023, 11:21am

Ignore that you have two models, and focus just on the predictive model. If you are creating some variables and indexing, you have to make sure those are consistent with each other. If you create a variable with ten entries and have an index with a value of 20 you are going to get an index error.

You can call pm.sample_prior_predictive on the predictive model to confirm you have an error in how you defined the model.

ricardoV94 · September 13, 2023, 11:24am

Also for debugging I suggest you trim down your model to the bare minimum that still reproduces the error, one intercept per group member is probably enough. This will make it much easier to find the source of the problem or for someone to find it for you.

ricardoV94 · September 13, 2023, 11:26am

And don’t be shy about posting the error messages you get, those are quite useful

jaharvey8 · January 24, 2024, 2:25pm

Oh wait, I see my error, ignore me.

jaharvey8 · January 24, 2024, 2:31pm

Actually let me post this because I’m not sure I’m doing this correctly. Would this be the correct way to modify your google colab @ricardoV94 to make predictions on one of the previous groups and two new groups? It’s unclear to me if I’m getting predictions for all 4 groups doing it this way or just the three desired groups.

idx = [0, 0, 0, 1, 1, 1]
coords = {
    "id": [0, 1],
    "obs_idx": list(range(len(idx))),
}
with pm.Model(coords=coords) as m:
    x = pm.Data("x", [0, 0, 0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = b[idx] * x
    obs = pm.Normal("obs", mu, observed=[0, 1, 2, 3, 4, 5], dims=("obs_idx",))

    idata = pm.sample()

# Predict two new groups
new_idx = [0, 0, 1, 2]
new_coords = {
    "id":[0,1],
    "new_id": [2, 3],
    "obs_idx": list(range(len(new_idx))),
}
with pm.Model(coords=new_coords) as pred_m:
    x = pm.Data("x", [0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    # Needs a new name, so old b is not used!
    new_b = pm.Normal("new_b", mu_group, dims=("new_id",))
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = pm.math.concatenate([new_b[new_idx], b[idx]]) * x
    obs = pm.Normal("obs", mu, dims=("obs_idx",))

    idata = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=True)

Edit: No, that can’t be right because “id” is still [0,1] but if I change “id” to [1] for example, i.e.,

# Predict two new groups
new_idx = [0, 0, 1, 2]
new_coords = {
    "id":[1],
    "new_id": [2, 3],
    "obs_idx": list(range(len(new_idx))),
}
with pm.Model(coords=new_coords) as pred_m:
    x = pm.Data("x", [0, 0, 0, 0], dims=("obs_idx",))
    mu_group = pm.Normal("mu_group")
    # Needs a new name, so old b is not used!
    new_b = pm.Normal("new_b", mu_group, dims=("new_id",))
    b = pm.Normal("b", mu_group, dims=("id",))
    mu = pm.math.concatenate([new_b[new_idx], b[idx]]) * x
    obs = pm.Normal("obs", mu, dims=("obs_idx",))

    idata = pm.sample_posterior_predictive(idata, var_names=["obs"], predictions=True, extend_inferencedata=True)

I get

TypeError: ("The type's shape ((1,)) is not compatible with the data's ((2,))", 'Container name "b"')

Topic		Replies	Views
Out of sample/model predictions in hierarchical model with new observed data v5 modeling	9	299	October 16, 2024
Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1) v5	5	588	March 2, 2023
How do we predict on new unseen groups in a hierarchical model in PyMC3? Questions	10	5393	September 12, 2022
Out of sample prediction with new category version agnostic	5	923	August 3, 2022
Pymc5 out of sample v5 modeling	9	1235	August 20, 2023

How to properly do out-of-sample prediction for hierarchical model

Related topics