Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1)

i_rb · February 27, 2023, 6:51pm

Hey everyone!
I have been trying to get out of sample predictions from my inference object but I keep on getting the below error whenever I use pm.sample_posterior_predictive with my model and trace.

size does not match the broadcast shape of the parameters. (1063,), (1063,), (4251,)

where 1063 is the dimensions of the test set on which I need samples. 4251 was the dimensions of the training set. I used pm.MutableData("X", X_train) and pm.MutableData("Y", Y_train) while building the model and using pm.set_data({"X": x_test, "y": y_test}) to get predictions as shown in this example.

I never got this error in pymc3 using theano shared variables, however, since I have shifted to pymc v5.0.1 I am getting these errors.

Any kind of help would be much appreciated!

jessegrabowski · February 27, 2023, 8:04pm

Do you use an index variable in the model, like mu = alpha[idx] + beta * feature_data? If so, you’ll have to make idx a mutable data object as well and pass in new indexes for your test data.

i_rb · March 1, 2023, 9:09am

I updated the code with the index variable as the mutable data and later changed it while inferencing but the error is more or less the same.

with model:

    
    X = pm.MutableData("X", X_train_aug)
    Y = pm.MutableData("Y", Y_train)
    
    group_idx = pm.MutableData("group_idx", g_id)

    mu_alpha = pm.Normal('mu_alpha', mu=0, sigma=50)
    sigma_alpha = pm.HalfNormal('sigma_alpha',15)
    mu_beta = pm.Normal('mu_beta', mu=1, sigma=10)
    sigma_beta = pm.HalfNormal('sigma_beta', 30)
    error = pm.Normal('e' , mu = -10, sigma = 20)

    beta1 = pm.Normal('beta1', mu_beta, sigma_beta,dims = 'group')
    beta2 = pm.Normal('beta2', mu_beta, sigma_beta,  dims = 'group')
    alpha = pm.Normal('alpha' , mu_alpha , sigma_alpha, dims = 'group')

    y = alpha[group_idx] + X_train_aug[col1].values*beta1[group_idx] + X_train_aug[col2].values*beta2[group_idx] + error
    target = pm.TruncatedNormal('target', mu = y,lower=0,upper=300, observed = Y)

    step = pm.NUTS(target_accept = 0.95)
    trace = pm.sample(3000,init='ADVI+adapt_diag',return_inferencedata=True, tune = 7000, step = step)

where g_id is the training indices with length 4251 (same as that of training samples).

For predictions I am using the below code:

with model:
    pm.set_data({"X": X_test_aug, "Y": Y_test, "group_idx": t_id})
    pm.sample_posterior_predictive(trace, var_names = ['target'], return_inferencedata=True, predictions=True,  extend_inferencedata=True)

where t_id is the test indices with length 1063 (same as that of testing samples).

the error i am getting is:

Input dimension mismatch. One other input has shape[0] = 1063, but input[1].shape[0] = 4251

jessegrabowski · March 1, 2023, 12:54pm

You are using X_train_aug, the underlying data, not X, the mutable object, in the y= line.

An easy check for problems like this is to always visualize your models with pm.model_to_graphiviz. If you did, you would see that X is disconnected from the rest of the graph, which would cause you to go look for why it isn’t informing y, which leads you to the bug.

i_rb · March 2, 2023, 2:53pm

I believe this will solve the problem. Thanks!

I have a few more syntax-related questions:

I have used X_train_aug in the y = ~ line because I need to use very specific columns for model building (see below):

y = alpha[group_idx] + X_train_aug[col1].values*beta1[group_idx] + X_train_aug[col2].values*beta2[group_idx] + error

In the above line of code, I have used X_train_aug[col1].values and X_train_aug[col2].values to specify my formula (in reality there are 40+ columns that I write manually). How can I do this with my mutable data object X? I cant seem to slice mutable data X by columns.

The example shown in this code uses two parameters (beta1 & beta2) to specify the formula:

    beta1 = pm.Normal('beta1', mu_beta, sigma_beta,dims = 'group')
    beta2 = pm.Normal('beta2', mu_beta, sigma_beta,  dims = 'group')
    alpha = pm.Normal('alpha' , mu_alpha , sigma_alpha, dims = 'group')

    y = alpha[group_idx] + X_train_aug[col1].values*beta1[group_idx] + X_train_aug[col2].values*beta2[group_idx] + error

In reality, I have 40+ columns so I need to initialize 40+ betas. I know I can use the shape parameter to initialize all of my priors in a single line of code, however, I also need to use the dims parameter value as well to tell my model that I have a hierarchy in the data. How can I write my model so that I have all the betas initialized with appropriate prior (Normal) with correct hierarchy dims = 'group'?

Right now I am doing something like this:

    beta1 = pm.Normal('beta1', mu_beta, sigma_beta,dims = 'group')
    beta2 = pm.Normal('beta2', mu_beta, sigma_beta,  dims = 'group')
    beta3 = pm.Normal('beta3', mu_beta, sigma_beta,  dims = 'group')
    beta4 = pm.Normal('beta4', mu_beta, sigma_beta,  dims = 'group')
    beta5 = pm.Normal('beta5', mu_beta, sigma_beta,  dims = 'group')
    beta6 = pm.Normal('beta6', mu_beta, sigma_beta,  dims = 'group')
    beta7 = pm.Normal('beta7', mu_beta, sigma_beta,  dims = 'group')
    beta8 = pm.Normal('beta8', mu_beta, sigma_beta,  dims = 'group')
    beta9 = pm.Normal('beta9', mu_beta, sigma_beta,  dims = 'group')
    beta10 = pm.Normal('beta10', mu_beta, sigma_beta,  dims = 'group')
    beta11 = pm.Normal('beta11', mu_beta, sigma_beta,  dims = 'group')
    beta12 = pm.Normal('beta12', mu_beta, sigma_beta,  dims = 'group')
    beta13 = pm.Normal('beta13', mu_beta, sigma_beta,  dims = 'group')
    beta14 = pm.Normal('beta14', mu_beta, sigma_beta,  dims = 'group')
    beta15 = pm.Normal('beta15', mu_beta, sigma_beta,  dims = 'group')
    beta16 = pm.Normal('beta16', mu_beta, sigma_beta,  dims = 'group')
    beta17 = pm.Normal('beta17', mu_beta, sigma_beta,  dims = 'group')
    beta18 = pm.Normal('beta18', mu_beta, sigma_beta,  dims = 'group')
    beta19 = pm.Normal('beta19', mu_beta, sigma_beta,  dims = 'group')
    ...
    y = alpha[g_id] + X_train_aug['col1'].values*beta1[group_idx] + X_train_aug['col2'].values*beta2[group_idx] + X_train_aug['col3'].values*beta3[group_idx] + ... +  X_train_aug['col45'].values*beta45[group_idx] + error

How can I achieve this efficiently?

jessegrabowski · March 2, 2023, 3:14pm

I’d add a value to coords for the feature, then make a single beta with everything bundled together. Example code:

import pymc as pm
import numpy as np
import pandas as pd
from string import ascii_letters

n_obs = 100
n_features = 20
n_groups = 5

X = np.random.normal(size=(n_obs, n_features))
groups = np.random.randint(0, n_groups, size=n_obs)
df = pd.DataFrame(X, columns = list(ascii_letters[:n_features]))

coords = {
   'features':df.columns,
   'groups': np.arange(n_groups)
}

with pm.Model(coords=coords) as mod:
   X = pm.MutableData('X', df)
   alpha = pm.Normal('alpha', dims=['groups'])
   betas = pm.Normal('beta', dims=['groups', 'features'])
   
   mu = alpha[groups] + (betas[groups] * X).sum(axis=-1)

betas is shape (groups, features) and gets blown up to (obs, features) when you index it with groups, so (betas[groups] * X).sum(axis=-1) is the same as you have, just compact.

For indexing X you have two choices:

(my recommended option) Make a function that creates and returns your model, and takes the dataframe and a list of columns as inputs. Use this list of columns to build the coords and slice your data inside the pm.MutableData constructor, so you don’t have to slice X anywhere else down-stream in the model.
If you really want to pass the entire dataframe (although you only use a subset of columns), you can slice by index number rather than index name. Make a list of features you want included in the regression, convert them to numerical indices for your dataframe, then slice betas and X accordingly. Example:

    feature_list_1 = list('afj')
    feature_list_2 = list('bcdehijkl')
    
    feature_idx_1 = [i for i, x in enumerate(df) if x in feature_list_1]
    feature_idx_2 = [i for i, x in enumerate(df) if x in feature_list_2]

    mu1 = alpha[groups] + (betas[groups, :][:, feature_idx_1] * X[:, feature_idx_1]).sum(axis=-1)
    mu2 = alpha[groups] + (betas[groups, :][:, feature_idx_2] * X[:, feature_idx_2]).sum(axis=-1)

If anyone knows how to avoid the awkward double-indexing on betas let me know. If you try to put them both in the same slice it will trigger fancy indexing, which isn’t what you want in this case.

Topic		Replies	Views
Issues when trying to do out of sample prediction v5	2	136	April 9, 2024
PyMC v5.10.3 prediction stuff v5 modeling	11	454	January 15, 2024
Help with Out of Sample Predictions	12	695	August 24, 2023
Shape issues with sample_posterior_predictive in PyMc5 Questions	3	751	July 1, 2023
Example for out-of-sample prediction with posterior predictive sampling v5	8	3154	October 28, 2022

Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1)

Related topics