How to name categorical variables?


#1

I previously ran a GLM model. One of the variables was categorical - “Day of Week” (dow).
The resultant varnames were very easy to read:
‘dow[T.Monday]’,
‘dow[T.Saturday]’,
‘dow[T.Sunday]’,
‘dow[T.Thursday]’,
‘dow[T.Tuesday]’,
‘dow[T.Wednesday]’

I subsequently created a full fledged model, still using dow as a variable. However, in this instance, the variables are named dow[0], dow[1], … dow[6].

Is there a way to name these variables the same way as the GLM case? they are very meaningful and easy to read.

I’m guessing it’s one of the additional parameters (like shape) but could not find any documentations on how to do this.

Thanks for your help.


#2

I don’t think it is possible, as in GLM the module add the label of each column from the design matrix as a random variable - thats where the meaningful label comes from. When you model it in a regular pymc3 model block, the RV only got one name, and if it is not a scaler pymc3 will display it element wise with RV_name_0 etc.


#3

ah, that’s a shame. Can I suggest that as as future feature?


#4

A follow-on question:

In GLM, the dummy variable encoding automatically drops one value. In other words, if there are n options, it creates n-1 dummy variables.

What’s the best way to achieve this in the general model?
I have a variable defined as:

b = pm.Normal(‘b’, mu=0., sd=0.5, shape=no_dow)
dow stands for “Day of week”.

Is it possible to define this so that:
b[0] = 0
b[1:] = pm.Normal(‘b’, mu=0., sd=0.5, shape=no_dow-1)

Or should I have used something different, like pm.Categorical ? (I’m not familiar with that distribution).


#5

We rely on Patsy to parse the linear equation. You can have a look at their manual for different options: http://patsy.readthedocs.io/en/latest/categorical-coding.html


#6

Thanks @junpenglao
patsy library seems very flexible, should be able to do a lot of things. I will explore manually how to use it to improve my code.

In the meantime, this is how I solve the categorical variables problem:

‘dow’ is a categorical variable “Day of Week”, naturally has 7 values.

Step 1: dummy encode
data = pd.get_dummies(data,prefix='dow',columns=['dow'],drop_first=True)
Make sure drop_first = True to remove redundant column

Step 2: set up data
dows = [col for col in data if str(col).startswith(‘dow_’)]
dow_cols = data[dows]
no_dow = len(dows)

Step 3: set up model
b = pm.Normal(‘b’, mu=0., sd=0.5, shape=no_dow) # Note no_dow is 1 fewer than the actual number of categories

k = pm.math.matrix_dot(dow_cols,b) 

Use theano dot matrix operation to perform the sum multiplication, which can handle matrix of any shape without manually typing out each category.

I think Patsy library can be employed to automate some of these steps, but this will work for now.