Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1)

Hey everyone!
I have been trying to get out of sample predictions from my inference object but I keep on getting the below error whenever I use pm.sample_posterior_predictive with my model and trace.

size does not match the broadcast shape of the parameters. (1063,), (1063,), (4251,)

where 1063 is the dimensions of the test set on which I need samples. 4251 was the dimensions of the training set. I used pm.MutableData("X", X_train) and pm.MutableData("Y", Y_train) while building the model and using pm.set_data({"X": x_test, "y": y_test}) to get predictions as shown in this example.

I never got this error in pymc3 using theano shared variables, however, since I have shifted to pymc v5.0.1 I am getting these errors.

Any kind of help would be much appreciated!

Do you use an index variable in the model, like mu = alpha[idx] + beta * feature_data? If so, you’ll have to make idx a mutable data object as well and pass in new indexes for your test data.

I updated the code with the index variable as the mutable data and later changed it while inferencing but the error is more or less the same.

with model:

    
    X = pm.MutableData("X", X_train_aug)
    Y = pm.MutableData("Y", Y_train)
    
    group_idx = pm.MutableData("group_idx", g_id)

    mu_alpha = pm.Normal('mu_alpha', mu=0, sigma=50)
    sigma_alpha = pm.HalfNormal('sigma_alpha',15)
    mu_beta = pm.Normal('mu_beta', mu=1, sigma=10)
    sigma_beta = pm.HalfNormal('sigma_beta', 30)
    error = pm.Normal('e' , mu = -10, sigma = 20)

    beta1 = pm.Normal('beta1', mu_beta, sigma_beta,dims = 'group')
    beta2 = pm.Normal('beta2', mu_beta, sigma_beta,  dims = 'group')
    alpha = pm.Normal('alpha' , mu_alpha , sigma_alpha, dims = 'group')

    y = alpha[group_idx] + X_train_aug[col1].values*beta1[group_idx] + X_train_aug[col2].values*beta2[group_idx] + error
    target = pm.TruncatedNormal('target', mu = y,lower=0,upper=300, observed = Y)

    step = pm.NUTS(target_accept = 0.95)
    trace = pm.sample(3000,init='ADVI+adapt_diag',return_inferencedata=True, tune = 7000, step = step)

where g_id is the training indices with length 4251 (same as that of training samples).

For predictions I am using the below code:

with model:
    pm.set_data({"X": X_test_aug, "Y": Y_test, "group_idx": t_id})
    pm.sample_posterior_predictive(trace, var_names = ['target'], return_inferencedata=True, predictions=True,  extend_inferencedata=True)

where t_id is the test indices with length 1063 (same as that of testing samples).

the error i am getting is:

Input dimension mismatch. One other input has shape[0] = 1063, but input[1].shape[0] = 4251

You are using X_train_aug, the underlying data, not X, the mutable object, in the y= line.

An easy check for problems like this is to always visualize your models with pm.model_to_graphiviz. If you did, you would see that X is disconnected from the rest of the graph, which would cause you to go look for why it isn’t informing y, which leads you to the bug.

I believe this will solve the problem. Thanks!

I have a few more syntax-related questions:

I have used X_train_aug in the y = ~ line because I need to use very specific columns for model building (see below):

y = alpha[group_idx] + X_train_aug[col1].values*beta1[group_idx] + X_train_aug[col2].values*beta2[group_idx] + error

In the above line of code, I have used X_train_aug[col1].values and X_train_aug[col2].values to specify my formula (in reality there are 40+ columns that I write manually). How can I do this with my mutable data object X? I cant seem to slice mutable data X by columns.

The example shown in this code uses two parameters (beta1 & beta2) to specify the formula:

    beta1 = pm.Normal('beta1', mu_beta, sigma_beta,dims = 'group')
    beta2 = pm.Normal('beta2', mu_beta, sigma_beta,  dims = 'group')
    alpha = pm.Normal('alpha' , mu_alpha , sigma_alpha, dims = 'group')

    y = alpha[group_idx] + X_train_aug[col1].values*beta1[group_idx] + X_train_aug[col2].values*beta2[group_idx] + error

In reality, I have 40+ columns so I need to initialize 40+ betas. I know I can use the shape parameter to initialize all of my priors in a single line of code, however, I also need to use the dims parameter value as well to tell my model that I have a hierarchy in the data. How can I write my model so that I have all the betas initialized with appropriate prior (Normal) with correct hierarchy dims = 'group'?

Right now I am doing something like this:

    beta1 = pm.Normal('beta1', mu_beta, sigma_beta,dims = 'group')
    beta2 = pm.Normal('beta2', mu_beta, sigma_beta,  dims = 'group')
    beta3 = pm.Normal('beta3', mu_beta, sigma_beta,  dims = 'group')
    beta4 = pm.Normal('beta4', mu_beta, sigma_beta,  dims = 'group')
    beta5 = pm.Normal('beta5', mu_beta, sigma_beta,  dims = 'group')
    beta6 = pm.Normal('beta6', mu_beta, sigma_beta,  dims = 'group')
    beta7 = pm.Normal('beta7', mu_beta, sigma_beta,  dims = 'group')
    beta8 = pm.Normal('beta8', mu_beta, sigma_beta,  dims = 'group')
    beta9 = pm.Normal('beta9', mu_beta, sigma_beta,  dims = 'group')
    beta10 = pm.Normal('beta10', mu_beta, sigma_beta,  dims = 'group')
    beta11 = pm.Normal('beta11', mu_beta, sigma_beta,  dims = 'group')
    beta12 = pm.Normal('beta12', mu_beta, sigma_beta,  dims = 'group')
    beta13 = pm.Normal('beta13', mu_beta, sigma_beta,  dims = 'group')
    beta14 = pm.Normal('beta14', mu_beta, sigma_beta,  dims = 'group')
    beta15 = pm.Normal('beta15', mu_beta, sigma_beta,  dims = 'group')
    beta16 = pm.Normal('beta16', mu_beta, sigma_beta,  dims = 'group')
    beta17 = pm.Normal('beta17', mu_beta, sigma_beta,  dims = 'group')
    beta18 = pm.Normal('beta18', mu_beta, sigma_beta,  dims = 'group')
    beta19 = pm.Normal('beta19', mu_beta, sigma_beta,  dims = 'group')
    ...
    y = alpha[g_id] + X_train_aug['col1'].values*beta1[group_idx] + X_train_aug['col2'].values*beta2[group_idx] + X_train_aug['col3'].values*beta3[group_idx] + ... +  X_train_aug['col45'].values*beta45[group_idx] + error

How can I achieve this efficiently?

I’d add a value to coords for the feature, then make a single beta with everything bundled together. Example code:

import pymc as pm
import numpy as np
import pandas as pd
from string import ascii_letters

n_obs = 100
n_features = 20
n_groups = 5

X = np.random.normal(size=(n_obs, n_features))
groups = np.random.randint(0, n_groups, size=n_obs)
df = pd.DataFrame(X, columns = list(ascii_letters[:n_features]))

coords = {
   'features':df.columns,
   'groups': np.arange(n_groups)
}

with pm.Model(coords=coords) as mod:
   X = pm.MutableData('X', df)
   alpha = pm.Normal('alpha', dims=['groups'])
   betas = pm.Normal('beta', dims=['groups', 'features'])
   
   mu = alpha[groups] + (betas[groups] * X).sum(axis=-1)

betas is shape (groups, features) and gets blown up to (obs, features) when you index it with groups, so (betas[groups] * X).sum(axis=-1) is the same as you have, just compact.

For indexing X you have two choices:

  1. (my recommended option) Make a function that creates and returns your model, and takes the dataframe and a list of columns as inputs. Use this list of columns to build the coords and slice your data inside the pm.MutableData constructor, so you don’t have to slice X anywhere else down-stream in the model.

  2. If you really want to pass the entire dataframe (although you only use a subset of columns), you can slice by index number rather than index name. Make a list of features you want included in the regression, convert them to numerical indices for your dataframe, then slice betas and X accordingly. Example:

    feature_list_1 = list('afj')
    feature_list_2 = list('bcdehijkl')
    
    feature_idx_1 = [i for i, x in enumerate(df) if x in feature_list_1]
    feature_idx_2 = [i for i, x in enumerate(df) if x in feature_list_2]

    mu1 = alpha[groups] + (betas[groups, :][:, feature_idx_1] * X[:, feature_idx_1]).sum(axis=-1)
    mu2 = alpha[groups] + (betas[groups, :][:, feature_idx_2] * X[:, feature_idx_2]).sum(axis=-1)

If anyone knows how to avoid the awkward double-indexing on betas let me know. If you try to put them both in the same slice it will trigger fancy indexing, which isn’t what you want in this case.

1 Like