Hello PyMC3 Community,
I have been trying to learn how to perform train test splits with PyMC3 by following the example here. However, the model I am interested has a multi-level hierarchical intercept term so model construction is more complex and thus I can’t simply swap out the data inside the model using a data container.
I am especially interested in how to deal with new labels for my hierarchy that can be introduced in the test set. For example, I have made some toy data and a multi-level hierarchical intercept term that fits the data well. The general idea of the model is that the region average informs the state average which informs the county average, i.e.
reggion_average = Normal(mu=5, sigma=2.5)
epsilon_state = region_average + state_level_noise
epsilon_county = state_average + county_level_noise
y = county_level_average + epsilon_county
and a visual in case it helps
The data consists of the features “region” of the united states, “state” and “county” where the measurements “y” are taken. The only difference between the train and test set is that in California, there is a new county introduced, “San Francisco”, which will create issues in the shapes/indices of my intercept.
import arviz as az
import numpy as np
import pandas as pd
import pymc3 as pm
train_data = pd.DataFrame({"region":["West"]*30 + ["East"]*10, "state":["CA"]*20 + ["NV"]*10 + ["NY"]*10,
"county":["Los Angeles"]*10 + ["San Diego"]*5 + ["Orange"]*5 + ["Clark"]*5 + ["Washoe"]*5 + ["Kings"]*5 + ["Queens"]*5,
"y": list(2+ 0.1*np.random.randn(10)) + list(2.5 + 0.1*np.random.randn(5)) + list(3. + 0.1*np.random.randn(5)) + list(5. + 0.1*np.random.randn(5)) + list(5.5 + 0.1*np.random.randn(5)) + list(8. + 0.1*np.random.randn(5)) + list(8.5 + 0.1*np.random.randn(5))})
test_data = pd.DataFrame({"region":["West"]*25 + ["East"]*10, "state":["CA"]*15 + ["NV"]*10 + ["NY"]*10,
"county":["San Francisco"]*5 + ["San Diego"]*5 + ["Orange"]*5 + ["Clark"]*5 + ["Washoe"]*5 + ["Kings"]*5 + ["Queens"]*5,
"y": list(1.5+ 0.1*np.random.randn(5)) + list(2.5 + 0.1*np.random.randn(5)) + list(3. + 0.1*np.random.randn(5)) + list(5. + 0.1*np.random.randn(5)) + list(5.5 + 0.1*np.random.randn(5)) + list(8. + 0.1*np.random.randn(5)) + list(8.5 + 0.1*np.random.randn(5))})
The code for preprocessing the data and the model are below:
train_data["state_region"] = train_data["state"] + "_" + train_data["region"]
train_data["state_region"] = train_data["state_region"].astype("category")
train_data["county_state_region"] = train_data["county"] + "_" + train_data["state"] + "_" + train_data["region"]
train_data["county_state_region"] = train_data["county_state_region"].astype("category")
train_data["region"] = train_data["region"].astype("category")
train_data["state"] = train_data["state"].astype("category")
train_data["county"] = train_data["county"].astype("category")
region_index_original = (
train_data[["region", "state", "state_region"]]
.drop_duplicates()["region"]
.cat.codes
)
state_index_original = (
train_data[[ "region", "state", "county", "county_state_region"]]
.drop_duplicates()["state"]
.cat.codes
)
county = train_data["county"].cat.codes
number_of_regions = train_data["region"].nunique()
number_of_states = train_data["state_region"].nunique()
number_of_counties = train_data["county_state_region"].nunique()
with pm.Model() as toy_model:
# Hyper Priors: Region
sigma_a_region = pm.HalfNormal("sigma_a_region", 1.)
mu_a_region = pm.Normal("mu_a_region", mu = 5., sigma = 2.5,)
a_region_offset = pm.Normal("a_region_offset", 0., 1., shape = number_of_regions)
# Country coefficients
a_region = pm.Deterministic("a_region", mu_a_region + sigma_a_region*a_region_offset)
# Hyper Priors: State
sigma_a_state = pm.HalfNormal("sigma_a_state", 2.)
a_state_offset = pm.Normal("a_state_offset", 0., 1., shape = number_of_states)
# State coefficients
a_state = pm.Deterministic("a_state", a_region[region_index_original.values] + sigma_a_state*a_state_offset)
# Hyper Priors: County
sigma_a_county = pm.HalfNormal("sigma_a_county", 2.)
a_county_offset = pm.Normal("a_county_offset", 0., 1., shape = number_of_counties)
# County coefficients
a_county = pm.Deterministic("a_county", a_state[state_index_original.values] + sigma_a_county*a_county_offset)
# Model error
epsilon = pm.HalfNormal("eps", 5.)
yield_estimate = (a_county[county.values]) # + pm.math.dot(X[modeling_columns].values, beta) )
# Data likelihood
yield_like = pm.Normal("yield_likelihood",
mu=yield_estimate,
sigma=epsilon,
observed = train_data.y,
)
with toy_model:
toy_idata = pm.sample(draws=1000, return_inferencedata=True, tune=500, target_accept=0.95)
The tricky part is that toy_idata
depends on the indices use to create the model which depends on the data in my training set. This leads me to believe that I will need to recreate the model on the test set data. However, if the test set’s indices don’t match the indices of the training set, then my inference isn’t helpful.
Any input on how I can use this model in a train test split set up would be amazing!