Modeling interactions between categorical variables using coords and dims?

What is the best/cleanest way to model the interactions between categorical variables while using PyMC’s shape functionalities (e.g. coords, dims, data containers)?

The example I’m working on is an unpooled negative binomial model where type and region are categorical variables: nv ~ type + region + region:type + enrollment_offset

I currently achieve this using the following code where in I create a new type_region field, but I feel as though there must be a cleaner way of doing this by referencing the type & region dims without creating additional columns.

df_crime = pd.read_csv('https://raw.githubusercontent.com/proback/BeyondMLR/master/data/c_data.csv',
    usecols=['nv', 'type', 'region', 'enroll1000'],
    dtype={'type':'category', 'region':'category'}                      
)

# I want a better way than this following line of code!
sr_type_region = pd.Categorical(df_crime.type.astype(str) +'_'+ df_crime.region.astype(str))

coords={
    'type': df_crime.type.cat.categories,
    'region': df_crime.region.cat.categories,
    'type_region':sr_type_region.categories,  # How do I replace this?
}

with pm.Model(coords=coords) as model_05: #negative binomial model

    # Data containers
    d_type = pm.Data('type', df_crime.type.cat.codes)
    d_region = pm.Data('region', df_crime.region.cat.codes)
    d_enrollment_offset = pm.Data('enrollment_offset', df_crime.enroll1000.values)
    d_crime = pm.Data('crime', df_crime.nv.values)

    # Coefs Priors
    b0 = pm.Normal('b0', mu=0, sigma=10)
    b_type = pm.Normal('b_type', mu=0, sigma=10, dims='type')
    b_region = pm.Normal('b_region', mu=0, sigma=10, dims='region')
    b_type_region = pm.Normal('b_type_region', mu=0, sigma=10,
        dims='type_region') # would like this to look more like dims=('type', 'region')

    # Systematic component
    y_sys = (
        b0
        + b_type[df_crime.type.cat.codes] * d_type
        + b_region[df_crime.region.cat.codes] * d_region
        + b_type_region[sr_type_region.codes] * sr_type_region.codes  # Is there a better way?
        + pm.math.log(d_enrollment_offset) # offset to weight crime counts by enrolments
    )

    # Inverse Link Function
    y_mu = pm.math.exp(y_sys)

    # Prior
    y_alpha = pm.Exponential('y_alpha', 0.5)

    # Random component
    y = pm.NegativeBinomial('y', mu=y_mu, alpha=y_alpha, observed=d_crime)

Is there a better way to index b_type_region? What is the best practice for setting the shape/dimensionality of interactions between categoricals in PyMC?

I’m fairly new to PyMC; Thanks for the help.

Hello,

The code you commented out as “would like this to look more like” should just work :wink:

b_type_region = pm.Normal('b_type_region', mu=0, sigma=10, dims=('type', 'region'))
# and then
(
    ...
    + b_type_region[sr_type.cat.codes, sr_region.cat.codes]
)

I’d try to avoid calling things “type” by the way. Everything should work, but you can’t for instance do trace.posterior.type because that is a keyword. Also, I’ve been bitten more than once by .cat.codes containing values of -1 for missing values. Because of that I’d always add an assert that -1 isn’t in that…

2 Likes

Thank you. Your pointers were very helpful to me! It took me a while to understand, but its exactly what I was looking for. Thanks!