Help figuring out simple categorical indexing

Hi all, and apologies if it’s a simple question. I’ve tried googling and for some reason still can’t wrap my head around it…

I’m trying to create a function that generated models which could have differing number of categorical features.

The way the model is defined (abbreviated) is the following:

def build_model(df, cat_cols=['c1', ...]):
  coords = {c: df[c].cat.categories, for c in cat_cols}
  with pm.Model(coords=coords) as m1:
      α = pm.Normal("α", 0, 1, dims=cat_cols)
      
      # Normally I would define something like this:
      code_1 = pm.MutableData("c1", df.c1.cat.codes)
      code_2 = pm.MutableData("c2", df.c2.cat.codes)
      μ = at.exp(α[code_1, code_2])
      
      # But since the categorical columns are not defined
      # ahead of time, I'm trying something like:
      codes = np.c_[[df[c].cat.codes for c in cat_cols]]
      codes_ = pm.MutableData("codes", codes_)
      μ = at.exp(α[codes])
      # doesn't work
      
      # A different approach that doesn't work
      codes = []
      for c in cat_cols:
        codes.append(pm.MutableData(c, df[c].cat.codes))
      μ = at.exp(α[at.stack(codes)])
      # OR
      μ = at.exp(α[at.stack(codes).T])

I just can’t seem to figure out how to do that. When I try the approaches above I invariable get an error about input dimension mismatch. Any help would be greately appreciated!

Thanks,
Omri

It may not be pretty or especially performant, but you could consider something like this:

code_list = [pm.MutableData(c, df[c].cat.codes) for c in cat_cols]

mu = alpha
for codes in code_list:
     mu = mu[codes]
mu = at.exp(mu)

Also, I’m attempting to guess your intentions here:

pm.Normal(“α”, 0, 1, dims=cat_cols)

but are you sure you want a RV with shape (n1, n2,...) as opposed to separate 1D arrays? The latter is more akin to a standard independent loglinear model whereas the former is including interactions between all the levels of c1, c2,… etc.

1 Like

Thanks for the reply! What I’m building is actually a Poisson GLM with multiple intercepts (one for each combination of categorical variables), following Richard McElreath’s recommendation of using indexing instead of one-hot encoding categorical variables.

In the end I managed to solve the issue with the help from people in the Learning Bayes Stats slack channel. The solution was quite elegant in the end:

import operator

# code before defining model context etc...
codes = [pm.MutableData(c, df[c].cat.codes) for c in cat_columns]
α = pm.Normal("α", 0, 1, dims=cat_columns)
μ = at.exp(operator.getitem(α, tuple(codes)) + ...)
# rest of the code goes here...

The code is a little more involved than that, but I just wanted to distill here the aspect I was stuck in. There is also an offset term in the formula that I didn’t include for brevity.

2 Likes

Thanks for sharing that solution here! I like it a lot and will definitely be using that one in the future for indexing in high dims.