Multi-level model with nonsense state and county coefficients

Hello PyMC3 Community,

I’m new to the Bayesian/PyMC3 community and I’m coming to you with a question about how to make my model make more sense. In my model, I want to have an intercept term that’s informed by the county my data is in which should be informed by the state that my county is in. However, when I finish running my model I end up with lots of coefficients that don’t make sense i.e., a coefficient for a county that doesn’t belong to a given state. Which brings me to my question. Is there a way I can inform my model to not create coefficients for state and county combos that aren’t present in my data?

I’m including my code below with the model with along with a screen shot of the intercept term of my model.

with pm.Model(coords=coords) as hierarchical_model_varying_intercept_county_state:
    
    # Hyper Priors: State
    sigma_a_state = pm.HalfNormal("sigma_a_state", 3.)
    mu_a_state = pm.Normal("mu_a_state",mu=9., sigma = 2.5)
    
    # State coefficients
    a_state_offset = pm.Normal("a_state_offset", mu = 0., sigma = 1., dims="state")
    a_state = pm.Deterministic("a_state", mu_a_state + sigma_a_state*a_state_offset)
    
    # Hyper Priors: County
    sigma_a_county = pm.HalfNormal("sigma_a_county", 3.)
    mu_a_county = pm.Normal("mu_a_county",mu = 0., sigma = 2.5)    

    # County coefficients
    a_county_offset = pm.Normal("a_county_offset", mu = 0., sigma = 1., dims= ("county", "state"))
    a_county = pm.Deterministic("a_county", mu_a_county + a_state + sigma_a_county*a_county_offset)
    
    # Model error
    epsilon = pm.HalfCauchy("eps", 5.0)

    yield_estimate = (
                 a_county[county_idx, state_idx]
                )

    # Data likelihood
    disease_like = pm.TruncatedNormal("disease_likelihood", 
                           mu=yield_estimate, 
                           sigma=epsilon, 
                           observed = data.disease.values, 
                           dims="obs_id", lower = 0.)

I think the issue is with dims= ("county", "state")). Try to reframe the model so that this is dims="county", and yield_estimate becomes

yield_estimate = (
                 a_county[county_idx] +
                )

Hi @DanWeitzenfeld

Sorry I took so long to get back to you. I’m brand new to this community and I triggered the spam bot with my first post.

I tried to work with your suggestion, but I seem to have made the problem worse. I think the issue is that I don’t understand how the dims from coords and the index variables county_idx, state_idx, and obs_id work together. In the following code, I’ve managed to remove “state” from dims in the definition of a_county_offset, but now when I run:

pm.model_to_graphviz(hierarchical_model_varying_intercept_county_state)

I can see that my model has built a coefficient for every observation. To make the discussion easier, I’ve included some sample data, which will hopefully make my code easier to follow.

import arviz as az
import numpy as np
import pandas as pd
import pymc3 as pm

sample_data = pd.DataFrame({"state":["CA", "NV", "CA", "NV", "CA", "CA", "NV", "CA", "NV", "CA", "CA", "NV", "CA", "NV", "CA"],
               "county":["Los Angeles", "Clark", "San Diego", "Washoe", "Orange", "Los Angeles", "Clark", "San Diego", "Washoe", "Orange", "Los Angeles", "Clark", "San Diego", "Washoe", "Orange"],
               "disease_count": [7., 8., 9., 10., 11., 6., 7., 8., 9., 10., 5., 6., 7., 8., 9.]})

county, counties = pd.factorize(sample_data["county"], sort = True)
state, states = pd.factorize(sample_data["state"], sort = True)

coords = {
    "state" : states,
    "county" : counties,
    "obs_id" : np.arange(sample_data.shape[0])
}```

with pm.Model(coords=coords) as hierarchical_model_varying_intercept_county_state:

state_idx = pm.Data("state_idx", state, dims = "obs_id")
county_idx = pm.Data("county_idx", county, dims="obs_id")

# Hyper Priors: State
sigma_a_state = pm.HalfNormal("sigma_a_state", 3.)
mu_a_state = pm.Normal("mu_a_state",mu=9., sigma = 2.5)

# State coefficients
a_state_offset = pm.Normal("a_state_offset", mu = 0., sigma = 1., dims="state")
a_state = pm.Deterministic("a_state", mu_a_state + sigma_a_state*a_state_offset)

# Hyper Priors: County
sigma_a_county = pm.HalfNormal("sigma_a_county", 3.)
mu_a_county = pm.Normal("mu_a_county",mu = 0., sigma = 2.5)    

# County coefficients
a_county_offset = pm.Normal("a_county_offset", mu = 0., sigma = 1., dims= "county")
a_county = pm.Deterministic("a_county", a_state[state_idx] + mu_a_county + sigma_a_county*a_county_offset[county_idx])

# Model error
epsilon = pm.HalfCauchy("eps", 5.0)

disease_estimate = (
              a_county[county_idx]
            )

# Data likelihood
disease_like = pm.TruncatedNormal("disease_likelihood", 
                       mu=disease_estimate, 
                       sigma=epsilon, 
                       observed = sample_data.disease_count.values, 
                       dims="obs_id", lower=0.)

I’m not yet familiar with pm.Data usage, so I got rid of it and just used the idx arrays generated by pandas factorize.
Also, I think mu_a_county is one too many intercepts, so I got rid of that as well.
I don’t have graphviz installed so I can’t check that, but this seems to work:

with pm.Model(coords=coords) as hierarchical_model_varying_intercept_county_state:

    # Hyper Priors: State
    sigma_a_state = pm.HalfNormal("sigma_a_state", 3.)
    mu_a_state = pm.Normal("mu_a_state", mu=9., sigma = 2.5)

    # State coefficients
    a_state_offset = pm.Normal("a_state_offset", mu = 0., sigma = 1., dims="state")
    a_state = pm.Deterministic("a_state", mu_a_state + sigma_a_state*a_state_offset) 

    # Hyper Priors: County
    sigma_a_county = pm.HalfNormal("sigma_a_county", 3.)
    
    # County coefficients
    a_county_offset = pm.Normal("a_county_offset", mu = 0., sigma = 1., dims= "county")
    a_county = pm.Deterministic("a_county", a_state[state] +  sigma_a_county*a_county_offset[county])

    # Model error
    epsilon = pm.HalfCauchy("eps", 5.0)

    # Data likelihood
    disease_like = pm.Normal("disease_likelihood", 
                           mu=a_county, 
                           sigma=epsilon, 
                           observed = sample_data.disease_count.values)

edit: Added code that generated figures.

I’m open to the possibility that I’m interpreting these results wrong or my expected number of coefficients is incorrect, but I would think that should only be a total of 5 intercept terms (one for each county), but my analysis of the model shows that there are as many coefficients as there are observations.

pm.model_to_graphviz(hierarchical_model_varying_intercept_county_state)

The only part I’m confident that looks correct are the state coefficients since there should only be two.

with hierarchical_model_varying_intercept_county_state:
    hierarchical_model_varying_intercept_county_state_idata = pm.sample(draws=500, return_inferencedata=True, tune=500, target_accept=0.95)

az.plot_forest(
hierarchical_model_varying_intercept_county_state_idata, var_names=“a_state”, figsize=(12, 2), r_hat=True, combined=True, textsize=16
);```

but when I pull up the county coefficients, you can see there are 15.

az.plot_forest(
     hierarchical_model_varying_intercept_county_state_idata, var_names="a_county", figsize=(12, 6), r_hat=True, combined=True, textsize=16
);

try:

az.plot_forest(
     trace, var_names="a_county_offset", figsize=(12, 6), r_hat=True, combined=True, textsize=16
);

a_county is expanded to the length of obs; the shape of a_county_offset is the # of counties

Ahhhh ok. So I was looking in the wrong spot. That’s such a relief to finally have this figured out. Thank you!!! Small follow up question. I’m curious what you use to look at the coefficients of your models?

I ask because I would rather see what the distribution of

a_state[state] +  sigma_a_county*a_county_offset[county]

is rather than a_county_offset for a given state/county or for all of the states/counties.

@DanWeitzenfeld I have a way to make the model more efficient. I’m updating my toy sample so that it’s easier to follow how the model is made:

import arviz as az
import numpy as np
import pandas as pd
import pymc3 as pm

toy_data = pd.DataFrame({"region":["West"]*25 + ["East"]*10, "state":["CA"]*15 + ["NV"]*10 + ["NY"]*10,
               "county":["Los Angeles"]*10 + ["San Diego"]*3 + ["Orange"]*2 + ["Clark"]*5 + ["Washoe"]*5 + ["Kings"]*5 + ["Queens"]*5,
               "y": list(2+ 0.1*np.random.randn(10)) + list(2.5 + 0.1*np.random.randn(3)) + list(3. + 0.1*np.random.randn(2)) + list(5. + 0.1*np.random.randn(5)) + list(5.5 + 0.1*np.random.randn(5)) + list(8. + 0.1*np.random.randn(5)) + list(8.5 + 0.1*np.random.randn(5))})

toy_data["state_region"] = toy_data["state"] + "_" + toy_data["region"]
toy_data["state_region"] = toy_data["state_region"].astype("category")
toy_data["county_state_region"] = toy_data["county"] + "_" + toy_data["state"] + "_" + toy_data["region"]
toy_data["county_state_region"] = toy_data["county_state_region"].astype("category")


toy_data["region"] = toy_data["region"].astype("category")
toy_data["state"] = toy_data["state"].astype("category")
toy_data["county"] = toy_data["county"].astype("category")

region_index = (
    toy_data[["region", "state", "state_region"]]
    .drop_duplicates()["region"]
    .cat.codes
)

state_index = (
    toy_data[[ "region", "state", "county", "county_state_region"]]
    .drop_duplicates()["state"]
    .cat.codes
)
county = toy_data["county"].cat.codes

number_of_regions = toy_data["region"].nunique()
number_of_states = toy_data["state"].nunique()
number_of_counties = toy_data["county"].nunique()

with pm.Model() as toy_model:

    # Hyper Priors: Region
    sigma_a_region = pm.HalfNormal("sigma_a_region", 1.)
    mu_a_region = pm.Normal("mu_a_region", mu = 5., sigma = 2.5,)
    a_region_offset = pm.Normal("a_region_offset", 0., 1., shape = number_of_regions)

    # Country coefficients
    a_region = pm.Deterministic("a_region", mu_a_region + sigma_a_region*a_region_offset)

    # Hyper Priors: State
    sigma_a_state = pm.HalfNormal("sigma_a_state", 2.)
    a_state_offset = pm.Normal("a_state_offset", 0., 1., shape = number_of_states)

    # State coefficients
    a_state = pm.Deterministic("a_state", a_region[region_index.values] + sigma_a_state*a_state_offset)    

    # Hyper Priors: County
    sigma_a_county = pm.HalfNormal("sigma_a_county", 2.)
    a_county_offset = pm.Normal("a_county_offset", 0., 1., shape = number_of_counties)

    # County coefficients
    a_county = pm.Deterministic("a_county", a_state[state_index.values] + sigma_a_county*a_county_offset)

    # Model error
    epsilon = pm.HalfNormal("eps", 5.)

    yield_estimate = (a_county[county.values]) # + pm.math.dot(X[modeling_columns].values, beta) )

    # Data likelihood
    yield_like = pm.Normal("yield_likelihood", 
                           mu=yield_estimate, 
                           sigma=epsilon, 
                           observed = toy_data.y,
                           )

Using this version of the model, I can more easily verify that the coefficients a_state and a_county make sense. I also added the a_region just to show how to expand it beyond state and county.

1 Like