Partial pooled model with complex categories


#1

Hi,

I am trying to do a regression analysis on my data. The data set has multiple observations, each of which is a time series (y~t) that can is modeled theoretically by y = A*(1+B*t)**B. I have tried the following model in pymc3 which works fine,

mu_A ~ Normal(…)
sig_A ~ HalfCauchy(…)
mu_B ~ Normal(…)
sig_B ~ HalfCauchy(…)

A ~ Normal(mu_A, sig_A, n_obs)
B ~ Normal(mu_B, sig_B, n_obs)

sig ~ HalfCauchy(…)
y ~ Normal( A[obs]*(1+B[obs]*t)**B[obs], sig, observed=…)

My question is: the observations have categorical parameters cat_1 and cat_2, for cat_1, each observation can only belong to one category of cat_1 (either cat_1_1 or cat_1_2); for cat_2, each observation can belong to one or more categories (observation #1 belong to cat_2_1, observation #2 belong to cat_2_1 and cat_2_3, observation #3 belong to cat_2_1 and cat_2_2 and cat_2_3). How can I add those categorical parameters in my model? Any suggestion will be greatly appreciated. Thanks!


#2

cat_1 should be relatively straight forward, as you can have a categorical predictor of 0 and 1. For cat_2 you do it similarly, but code each category as a feature and have a similar 0 and 1 predictor. Something like:

cat1 = [0, 1, 1, 0, ...]
cat2_1 =  [0, 0, 1, 0, 1, 0, ...]
cat2_2 =  [1, 0, 1, 1, 0, 0, ...]
...

with the linear prediction:

cat1beta = Normal(..., shape=2)
cat2_1beta = Normal(...)
cat2_2beta = Normal(...)
cat2_3beta = Normal(...)
y = A*(1+B*t)**B + cat1beta[cat1] + cat2_1beta*cat2_1 + ...

#3

I have recently encountered this problem. Here’s how I solved it:

step 1 - use pandas to dummy code your categorical variable:
A = pd.get_dummies(A,prefix='cat1_',columns=['cat_1'],drop_first=True)
make sure you set drop_first = True, as this removes the redundant column.
After this step, you should see a few new columns in your dataframe of [0,1] that correspond to cat_1

Step 2 - select cat_1 matrix:
cat1_col_list = [col for col in data if col.startswith(‘cat1_’)]
cat1_cols = data[cat1_col_list ]
no_cat1 = len(cat1_col_list )

Step 3 - sample with the categorical variable
I would modify @junpenglao 's suggestions slightly to
cat1beta = Normal(…, shape = no_cat1)
y = A*(1+Bt)**B + math.matrix_dot(cat1_cols,cat1beta )

This way, you should be able to handle categorical variables with any number of categories without having to manually write up each category.

@junpenglao can you check if this makes sense? particularly with the use of theano matrix_dot operations.


#4

Yep, the two should be equivalent, but using matrix multiplication would be more flexible and easier to extend in most case.