Hi all, and apologies if it’s a simple question. I’ve tried googling and for some reason still can’t wrap my head around it…
I’m trying to create a function that generated models which could have differing number of categorical features.
The way the model is defined (abbreviated) is the following:
def build_model(df, cat_cols=['c1', ...]):
coords = {c: df[c].cat.categories, for c in cat_cols}
with pm.Model(coords=coords) as m1:
α = pm.Normal("α", 0, 1, dims=cat_cols)
# Normally I would define something like this:
code_1 = pm.MutableData("c1", df.c1.cat.codes)
code_2 = pm.MutableData("c2", df.c2.cat.codes)
μ = at.exp(α[code_1, code_2])
# But since the categorical columns are not defined
# ahead of time, I'm trying something like:
codes = np.c_[[df[c].cat.codes for c in cat_cols]]
codes_ = pm.MutableData("codes", codes_)
μ = at.exp(α[codes])
# doesn't work
# A different approach that doesn't work
codes = []
for c in cat_cols:
codes.append(pm.MutableData(c, df[c].cat.codes))
μ = at.exp(α[at.stack(codes)])
# OR
μ = at.exp(α[at.stack(codes).T])
I just can’t seem to figure out how to do that. When I try the approaches above I invariable get an error about input dimension mismatch. Any help would be greately appreciated!
It may not be pretty or especially performant, but you could consider something like this:
code_list = [pm.MutableData(c, df[c].cat.codes) for c in cat_cols]
mu = alpha
for codes in code_list:
mu = mu[codes]
mu = at.exp(mu)
Also, I’m attempting to guess your intentions here:
pm.Normal(“α”, 0, 1, dims=cat_cols)
but are you sure you want a RV with shape (n1, n2,...) as opposed to separate 1D arrays? The latter is more akin to a standard independent loglinear model whereas the former is including interactions between all the levels of c1, c2,… etc.
Thanks for the reply! What I’m building is actually a Poisson GLM with multiple intercepts (one for each combination of categorical variables), following Richard McElreath’s recommendation of using indexing instead of one-hot encoding categorical variables.
In the end I managed to solve the issue with the help from people in the Learning Bayes Stats slack channel. The solution was quite elegant in the end:
import operator
# code before defining model context etc...
codes = [pm.MutableData(c, df[c].cat.codes) for c in cat_columns]
α = pm.Normal("α", 0, 1, dims=cat_columns)
μ = at.exp(operator.getitem(α, tuple(codes)) + ...)
# rest of the code goes here...
The code is a little more involved than that, but I just wanted to distill here the aspect I was stuck in. There is also an offset term in the formula that I didn’t include for brevity.