Help figuring out simple categorical indexing

Hi all, and apologies if it’s a simple question. I’ve tried googling and for some reason still can’t wrap my head around it…

I’m trying to create a function that generated models which could have differing number of categorical features.

The way the model is defined (abbreviated) is the following:

def build_model(df, cat_cols=['c1', ...]):
  coords = {c: df[c].cat.categories, for c in cat_cols}
  with pm.Model(coords=coords) as m1:
      α = pm.Normal("α", 0, 1, dims=cat_cols)
      # Normally I would define something like this:
      code_1 = pm.MutableData("c1",
      code_2 = pm.MutableData("c2",
      μ = at.exp(α[code_1, code_2])
      # But since the categorical columns are not defined
      # ahead of time, I'm trying something like:
      codes = np.c_[[df[c] for c in cat_cols]]
      codes_ = pm.MutableData("codes", codes_)
      μ = at.exp(α[codes])
      # doesn't work
      # A different approach that doesn't work
      codes = []
      for c in cat_cols:
        codes.append(pm.MutableData(c, df[c]
      μ = at.exp(α[at.stack(codes)])
      # OR
      μ = at.exp(α[at.stack(codes).T])

I just can’t seem to figure out how to do that. When I try the approaches above I invariable get an error about input dimension mismatch. Any help would be greately appreciated!


It may not be pretty or especially performant, but you could consider something like this:

code_list = [pm.MutableData(c, df[c] for c in cat_cols]

mu = alpha
for codes in code_list:
     mu = mu[codes]
mu = at.exp(mu)

Also, I’m attempting to guess your intentions here:

pm.Normal(“α”, 0, 1, dims=cat_cols)

but are you sure you want a RV with shape (n1, n2,...) as opposed to separate 1D arrays? The latter is more akin to a standard independent loglinear model whereas the former is including interactions between all the levels of c1, c2,… etc.

1 Like

Thanks for the reply! What I’m building is actually a Poisson GLM with multiple intercepts (one for each combination of categorical variables), following Richard McElreath’s recommendation of using indexing instead of one-hot encoding categorical variables.

In the end I managed to solve the issue with the help from people in the Learning Bayes Stats slack channel. The solution was quite elegant in the end:

import operator

# code before defining model context etc...
codes = [pm.MutableData(c, df[c] for c in cat_columns]
α = pm.Normal("α", 0, 1, dims=cat_columns)
μ = at.exp(operator.getitem(α, tuple(codes)) + ...)
# rest of the code goes here...

The code is a little more involved than that, but I just wanted to distill here the aspect I was stuck in. There is also an offset term in the formula that I didn’t include for brevity.


Thanks for sharing that solution here! I like it a lot and will definitely be using that one in the future for indexing in high dims.