Train Test Splits with Multi-level hierarchical regression model where the test set contains unseen values in the hierarchy

Hello PyMC3 Community,

I have been trying to learn how to perform train test splits with PyMC3 by following the example here. However, the model I am interested has a multi-level hierarchical intercept term so model construction is more complex and thus I can’t simply swap out the data inside the model using a data container.

I am especially interested in how to deal with new labels for my hierarchy that can be introduced in the test set. For example, I have made some toy data and a multi-level hierarchical intercept term that fits the data well. The general idea of the model is that the region average informs the state average which informs the county average, i.e.

reggion_average = Normal(mu=5, sigma=2.5)
epsilon_state = region_average + state_level_noise
epsilon_county = state_average + county_level_noise
y = county_level_average + epsilon_county

and a visual in case it helps

The data consists of the features “region” of the united states, “state” and “county” where the measurements “y” are taken. The only difference between the train and test set is that in California, there is a new county introduced, “San Francisco”, which will create issues in the shapes/indices of my intercept.

import arviz as az
import numpy as np
import pandas as pd
import pymc3 as pm
train_data = pd.DataFrame({"region":["West"]*30 + ["East"]*10, "state":["CA"]*20 + ["NV"]*10 + ["NY"]*10,
               "county":["Los Angeles"]*10 + ["San Diego"]*5 + ["Orange"]*5 + ["Clark"]*5 + ["Washoe"]*5 + ["Kings"]*5 + ["Queens"]*5,
               "y": list(2+ 0.1*np.random.randn(10)) + list(2.5 + 0.1*np.random.randn(5)) + list(3. + 0.1*np.random.randn(5)) + list(5. + 0.1*np.random.randn(5)) + list(5.5 + 0.1*np.random.randn(5)) + list(8. + 0.1*np.random.randn(5)) + list(8.5 + 0.1*np.random.randn(5))})

test_data = pd.DataFrame({"region":["West"]*25 + ["East"]*10, "state":["CA"]*15 + ["NV"]*10 + ["NY"]*10,
               "county":["San Francisco"]*5 + ["San Diego"]*5 + ["Orange"]*5 + ["Clark"]*5 + ["Washoe"]*5 + ["Kings"]*5 + ["Queens"]*5,
               "y": list(1.5+ 0.1*np.random.randn(5)) + list(2.5 + 0.1*np.random.randn(5)) + list(3. + 0.1*np.random.randn(5)) + list(5. + 0.1*np.random.randn(5)) + list(5.5 + 0.1*np.random.randn(5)) + list(8. + 0.1*np.random.randn(5)) + list(8.5 + 0.1*np.random.randn(5))})

The code for preprocessing the data and the model are below:

train_data["state_region"] = train_data["state"] + "_" + train_data["region"]
train_data["state_region"] = train_data["state_region"].astype("category")
train_data["county_state_region"] = train_data["county"] + "_" + train_data["state"] + "_" + train_data["region"]
train_data["county_state_region"] = train_data["county_state_region"].astype("category")

train_data["region"] = train_data["region"].astype("category")
train_data["state"] = train_data["state"].astype("category")
train_data["county"] = train_data["county"].astype("category")

region_index_original = (
    train_data[["region", "state", "state_region"]]
    .drop_duplicates()["region"]
    .cat.codes
)

state_index_original = (
    train_data[[ "region", "state", "county", "county_state_region"]]
    .drop_duplicates()["state"]
    .cat.codes
)

county = train_data["county"].cat.codes

number_of_regions = train_data["region"].nunique()
number_of_states = train_data["state_region"].nunique()
number_of_counties = train_data["county_state_region"].nunique()
with pm.Model() as toy_model:

    # Hyper Priors: Region
    sigma_a_region = pm.HalfNormal("sigma_a_region", 1.)
    mu_a_region = pm.Normal("mu_a_region", mu = 5., sigma = 2.5,)
    a_region_offset = pm.Normal("a_region_offset", 0., 1., shape = number_of_regions)

    # Country coefficients
    a_region = pm.Deterministic("a_region", mu_a_region + sigma_a_region*a_region_offset)

    # Hyper Priors: State
    sigma_a_state = pm.HalfNormal("sigma_a_state", 2.)
    a_state_offset = pm.Normal("a_state_offset", 0., 1., shape = number_of_states)

    # State coefficients
    a_state = pm.Deterministic("a_state", a_region[region_index_original.values] + sigma_a_state*a_state_offset)    

    # Hyper Priors: County
    sigma_a_county = pm.HalfNormal("sigma_a_county", 2.)
    a_county_offset = pm.Normal("a_county_offset", 0., 1., shape = number_of_counties)

    # County coefficients
    a_county = pm.Deterministic("a_county", a_state[state_index_original.values] + sigma_a_county*a_county_offset)

    # Model error
    epsilon = pm.HalfNormal("eps", 5.)

    yield_estimate = (a_county[county.values]) # + pm.math.dot(X[modeling_columns].values, beta) )

    # Data likelihood
    yield_like = pm.Normal("yield_likelihood", 
                           mu=yield_estimate, 
                           sigma=epsilon, 
                           observed = train_data.y,
                           )
with toy_model:
    toy_idata = pm.sample(draws=1000, return_inferencedata=True, tune=500, target_accept=0.95)

The tricky part is that toy_idata depends on the indices use to create the model which depends on the data in my training set. This leads me to believe that I will need to recreate the model on the test set data. However, if the test set’s indices don’t match the indices of the training set, then my inference isn’t helpful.

Any input on how I can use this model in a train test split set up would be amazing!

So after reading this post, I was determined to experiment with model factory. The method that I believe will work is to add new values to the train set and test set that includes region, state, and county combinations that the other is missing but with np.nan y values.

That being said, I can’t post this as a solution as I have discovered that the base model has an issue. The a_state and a_region coefficients make no sense. I believe they should be centered around the average value for observations in their state and region respectively, but they are not. When I discover what’s wrong and have the missing values figured out, I’ll post an update.