Short answer: You should link the shape of Y_obs to one of the shape of the exogenous data, say Location, so pytensor knows internally to update the shape of Y_obs when the shape of Location changes. This was explained in the linked example. You do it as follows:
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=training_data['Data_Value'], shape=Location.shape[0])
Beyond that, some suggestions:
- You should use
coordsanddims, and shown in the example already linked. This will make your code significantly easier to reason about a read. - Somewhat related,
pd.factorizeis the suggested way to handle label encoding for PyMC models. See the notes below. - There’s no reason to split up your data. It should be much easier to keep in a matrix, and work with this directly.
- Your slopes seem to be wrong. You have a single mean and sigma hyperprior for all feature/location pairs. This encodes that the estimated effect for, say, obesity in location A, is drawn from the same distribution as the effect for the average temperature in region B. I suspect this is not what you intend.
- Code inside parenthesis can include line breaks; there’s no need to use this
\notation
My suggested corrections:
import numpy as np
import pandas as pd
import pymc as pm
from string import ascii_uppercase
from sklearn.model_selection import train_test_split
# Generate random data
df = pd.DataFrame(np.random.normal(size=(500, 4)), columns=['obesity_Prevalence', 'data_value_py', 'Avg Temp(°F)', 'AQI'])
df['location'] = np.random.choice(list(ascii_uppercase), size=500, replace=True)
# Stratify by location so all locations are in both the train and test set
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.location.values)
# Sort doesn't sort the rows of the data, it sorts the indices being assign to each location. When
# sort = False, label 0 belongs to the first class encountered. When True, it belongs to the sorted
# first class. Its important to sort here so that class 0 is the same thing in both the train and
# test data (although the location in the first row of data might be different in each)
location_idx_train, locations = pd.factorize(df_train.location, sort=True)
location_idx_test, _ = pd.factorize(df_test.location, sort=True)
features = ['obesity_Prevalence', 'Avg Temp(°F)', 'AQI']
coords = {
'location':locations,
'feature': features,
}
# Define the model
with pm.Model(coords=coords) as hierarchical_model:
#Define the data
X = pm.Data('X', df_train[features])
y = pm.Data('y', df_train['data_value_py'])
location_idx_pt = pm.Data('location_idx', location_idx_train)
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0, sigma=0.5)
sigma_a = pm.HalfCauchy('sigma_a', beta=1)
a = pm.Normal('a', mu=mu_a, sigma=sigma_a, dims=['location'])
# Hyperpriors for group slopes
mu_b = pm.Normal('mu_b', mu=0, sigma=1, dims=['feature'])
sigma_b = pm.HalfCauchy('sigma_b', beta=1, dims=['feature'])
b = pm.Normal('b', mu=mu_b, sigma=sigma_b, dims=['location', 'feature'])
# Model error
sigma = pm.HalfCauchy('sigma', beta=1)
# Expected value
mu = a[location_idx_pt] + (X * b[location_idx_pt]).sum(axis=-1)
# Likelihood
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=y, size=X.shape[0])
idata = pm.sample(nuts_sampler='nutpie')
Out of sample prediction:
with hierarchical_model:
pm.set_data({'X':df_test[features], 'location_idx':location_idx_test})
idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)