Hi,
I am trying to do a regression analysis on my data. The data set has multiple observations, each of which is a time series (y~t) that can is modeled theoretically by y = A*(1+B*t)**B. I have tried the following model in pymc3 which works fine,
mu_A ~ Normal(…)
sig_A ~ HalfCauchy(…)
mu_B ~ Normal(…)
sig_B ~ HalfCauchy(…)
A ~ Normal(mu_A, sig_A, n_obs)
B ~ Normal(mu_B, sig_B, n_obs)
sig ~ HalfCauchy(…)
y ~ Normal( A[obs]*(1+B[obs]*t)**B[obs], sig, observed=…)
My question is: the observations have categorical parameters cat_1 and cat_2, for cat_1, each observation can only belong to one category of cat_1 (either cat_1_1 or cat_1_2); for cat_2, each observation can belong to one or more categories (observation #1 belong to cat_2_1, observation #2 belong to cat_2_1 and cat_2_3, observation #3 belong to cat_2_1 and cat_2_2 and cat_2_3). How can I add those categorical parameters in my model? Any suggestion will be greatly appreciated. Thanks!
1 Like
cat_1 should be relatively straight forward, as you can have a categorical predictor of 0 and 1. For cat_2 you do it similarly, but code each category as a feature and have a similar 0 and 1 predictor. Something like:
cat1 = [0, 1, 1, 0, ...]
cat2_1 = [0, 0, 1, 0, 1, 0, ...]
cat2_2 = [1, 0, 1, 1, 0, 0, ...]
...
with the linear prediction:
cat1beta = Normal(..., shape=2)
cat2_1beta = Normal(...)
cat2_2beta = Normal(...)
cat2_3beta = Normal(...)
y = A*(1+B*t)**B + cat1beta[cat1] + cat2_1beta*cat2_1 + ...
I have recently encountered this problem. Here’s how I solved it:
step 1 - use pandas to dummy code your categorical variable:
A = pd.get_dummies(A,prefix='cat1_',columns=['cat_1'],drop_first=True)
make sure you set drop_first = True, as this removes the redundant column.
After this step, you should see a few new columns in your dataframe of [0,1] that correspond to cat_1
Step 2 - select cat_1 matrix:
cat1_col_list = [col for col in data if col.startswith(‘cat1_’)]
cat1_cols = data[cat1_col_list ]
no_cat1 = len(cat1_col_list )
Step 3 - sample with the categorical variable
I would modify @junpenglao 's suggestions slightly to
cat1beta = Normal(…, shape = no_cat1)
y = A*(1+Bt)**B + math.matrix_dot(cat1_cols,cat1beta )
This way, you should be able to handle categorical variables with any number of categories without having to manually write up each category.
@junpenglao can you check if this makes sense? particularly with the use of theano matrix_dot operations.
2 Likes
Yep, the two should be equivalent, but using matrix multiplication would be more flexible and easier to extend in most case.