Modeling interactions between categorical variables using coords and dims?

Kev · June 14, 2024, 4:46pm

What is the best/cleanest way to model the interactions between categorical variables while using PyMC’s shape functionalities (e.g. coords, dims, data containers)?

The example I’m working on is an unpooled negative binomial model where type and region are categorical variables: nv ~ type + region + region:type + enrollment_offset

I currently achieve this using the following code where in I create a new type_region field, but I feel as though there must be a cleaner way of doing this by referencing the type & region dims without creating additional columns.

df_crime = pd.read_csv('https://raw.githubusercontent.com/proback/BeyondMLR/master/data/c_data.csv',
    usecols=['nv', 'type', 'region', 'enroll1000'],
    dtype={'type':'category', 'region':'category'}                      
)

# I want a better way than this following line of code!
sr_type_region = pd.Categorical(df_crime.type.astype(str) +'_'+ df_crime.region.astype(str))

coords={
    'type': df_crime.type.cat.categories,
    'region': df_crime.region.cat.categories,
    'type_region':sr_type_region.categories,  # How do I replace this?
}

with pm.Model(coords=coords) as model_05: #negative binomial model

    # Data containers
    d_type = pm.Data('type', df_crime.type.cat.codes)
    d_region = pm.Data('region', df_crime.region.cat.codes)
    d_enrollment_offset = pm.Data('enrollment_offset', df_crime.enroll1000.values)
    d_crime = pm.Data('crime', df_crime.nv.values)

    # Coefs Priors
    b0 = pm.Normal('b0', mu=0, sigma=10)
    b_type = pm.Normal('b_type', mu=0, sigma=10, dims='type')
    b_region = pm.Normal('b_region', mu=0, sigma=10, dims='region')
    b_type_region = pm.Normal('b_type_region', mu=0, sigma=10,
        dims='type_region') # would like this to look more like dims=('type', 'region')

    # Systematic component
    y_sys = (
        b0
        + b_type[df_crime.type.cat.codes] * d_type
        + b_region[df_crime.region.cat.codes] * d_region
        + b_type_region[sr_type_region.codes] * sr_type_region.codes  # Is there a better way?
        + pm.math.log(d_enrollment_offset) # offset to weight crime counts by enrolments
    )

    # Inverse Link Function
    y_mu = pm.math.exp(y_sys)

    # Prior
    y_alpha = pm.Exponential('y_alpha', 0.5)

    # Random component
    y = pm.NegativeBinomial('y', mu=y_mu, alpha=y_alpha, observed=d_crime)

Is there a better way to index b_type_region? What is the best practice for setting the shape/dimensionality of interactions between categoricals in PyMC?

I’m fairly new to PyMC; Thanks for the help.

aseyboldt · June 17, 2024, 12:54pm

Hello,

The code you commented out as “would like this to look more like” should just work

b_type_region = pm.Normal('b_type_region', mu=0, sigma=10, dims=('type', 'region'))
# and then
(
    ...
    + b_type_region[sr_type.cat.codes, sr_region.cat.codes]
)

I’d try to avoid calling things “type” by the way. Everything should work, but you can’t for instance do trace.posterior.type because that is a keyword. Also, I’ve been bitten more than once by .cat.codes containing values of -1 for missing values. Because of that I’d always add an assert that -1 isn’t in that…

Kev · June 17, 2024, 3:22pm

Thank you. Your pointers were very helpful to me! It took me a while to understand, but its exactly what I was looking for. Thanks!

Topic		Replies	Views
Modelling interactions between two categorical variables Questions	2	966	March 15, 2020
Regression model using only categorical variables Questions	9	5911	October 26, 2021
Understanding coords, indexation, Data, ..., for multilevel models v5 modeling	1	3816	April 29, 2022
Systematic introduction of coords, dims and shapes version agnostic	3	73	November 3, 2024
Why my multinomial model with categorical predictors and response differs so much from results in R? version agnostic modeling	5	97	January 26, 2025

Modeling interactions between categorical variables using coords and dims?

Related topics