Baysian hirerchical Linear Regression(Partial Pooling) model using PYMC

Short answer: You should link the shape of Y_obs to one of the shape of the exogenous data, say Location, so pytensor knows internally to update the shape of Y_obs when the shape of Location changes. This was explained in the linked example. You do it as follows:

Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=training_data['Data_Value'], shape=Location.shape[0])

Beyond that, some suggestions:

  1. You should use coords and dims, and shown in the example already linked. This will make your code significantly easier to reason about a read.
  2. Somewhat related, pd.factorize is the suggested way to handle label encoding for PyMC models. See the notes below.
  3. There’s no reason to split up your data. It should be much easier to keep in a matrix, and work with this directly.
  4. Your slopes seem to be wrong. You have a single mean and sigma hyperprior for all feature/location pairs. This encodes that the estimated effect for, say, obesity in location A, is drawn from the same distribution as the effect for the average temperature in region B. I suspect this is not what you intend.
  5. Code inside parenthesis can include line breaks; there’s no need to use this \ notation

My suggested corrections:

import numpy as np
import pandas as pd
import pymc as pm

from string import ascii_uppercase
from sklearn.model_selection import train_test_split

# Generate random data
df = pd.DataFrame(np.random.normal(size=(500, 4)), columns=['obesity_Prevalence', 'data_value_py', 'Avg Temp(°F)', 'AQI'])
df['location'] = np.random.choice(list(ascii_uppercase), size=500, replace=True)

# Stratify by location so all locations are in both the train and test set
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.location.values)

# Sort doesn't sort the rows of the data, it sorts the indices being assign to each location. When 
# sort = False, label 0 belongs to the first class encountered. When True, it belongs to the sorted
# first class. Its important to sort here so that class 0 is the same thing in both the train and 
# test data (although the location in the first row of data might be different in each)
location_idx_train, locations = pd.factorize(df_train.location, sort=True)
location_idx_test, _ = pd.factorize(df_test.location, sort=True)
features = ['obesity_Prevalence', 'Avg Temp(°F)', 'AQI']

coords = {
    'location':locations,
    'feature': features,
}

# Define the model
with pm.Model(coords=coords) as hierarchical_model:
    #Define the data
    X = pm.Data('X', df_train[features])
    y = pm.Data('y', df_train['data_value_py'])
    location_idx_pt = pm.Data('location_idx', location_idx_train)
    
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0, sigma=0.5)
    sigma_a = pm.HalfCauchy('sigma_a', beta=1)
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, dims=['location'])

    # Hyperpriors for group slopes
    mu_b = pm.Normal('mu_b', mu=0, sigma=1, dims=['feature'])
    sigma_b = pm.HalfCauchy('sigma_b', beta=1, dims=['feature'])
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b, dims=['location', 'feature'])

    # Model error
    sigma = pm.HalfCauchy('sigma', beta=1)

    # Expected value
    mu = a[location_idx_pt] + (X * b[location_idx_pt]).sum(axis=-1)
    
    # Likelihood
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=y, size=X.shape[0])
    idata = pm.sample(nuts_sampler='nutpie')

Out of sample prediction:

with hierarchical_model:
    pm.set_data({'X':df_test[features], 'location_idx':location_idx_test})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)
1 Like