How to name categorical variables?

benyi-mikara · July 11, 2018, 5:15am

I previously ran a GLM model. One of the variables was categorical - “Day of Week” (dow).
The resultant varnames were very easy to read:
‘dow[T.Monday]’,
‘dow[T.Saturday]’,
‘dow[T.Sunday]’,
‘dow[T.Thursday]’,
‘dow[T.Tuesday]’,
‘dow[T.Wednesday]’

I subsequently created a full fledged model, still using dow as a variable. However, in this instance, the variables are named dow[0], dow[1], … dow[6].

Is there a way to name these variables the same way as the GLM case? they are very meaningful and easy to read.

I’m guessing it’s one of the additional parameters (like shape) but could not find any documentations on how to do this.

Thanks for your help.

junpenglao · July 11, 2018, 5:40am

I don’t think it is possible, as in GLM the module add the label of each column from the design matrix as a random variable - thats where the meaningful label comes from. When you model it in a regular pymc3 model block, the RV only got one name, and if it is not a scaler pymc3 will display it element wise with RV_name_0 etc.

benyi-mikara · July 11, 2018, 5:42am

ah, that’s a shame. Can I suggest that as as future feature?

benyi-mikara · July 11, 2018, 7:54am

A follow-on question:

In GLM, the dummy variable encoding automatically drops one value. In other words, if there are n options, it creates n-1 dummy variables.

What’s the best way to achieve this in the general model?
I have a variable defined as:

b = pm.Normal(‘b’, mu=0., sd=0.5, shape=no_dow)
dow stands for “Day of week”.

Is it possible to define this so that:
b[0] = 0
b[1:] = pm.Normal(‘b’, mu=0., sd=0.5, shape=no_dow-1)

Or should I have used something different, like pm.Categorical ? (I’m not familiar with that distribution).

junpenglao · July 11, 2018, 9:15am

We rely on Patsy to parse the linear equation. You can have a look at their manual for different options: http://patsy.readthedocs.io/en/latest/categorical-coding.html

benyi-mikara · July 12, 2018, 8:09am

Thanks @junpenglao
patsy library seems very flexible, should be able to do a lot of things. I will explore manually how to use it to improve my code.

In the meantime, this is how I solve the categorical variables problem:

‘dow’ is a categorical variable “Day of Week”, naturally has 7 values.

Step 1: dummy encode
data = pd.get_dummies(data,prefix='dow',columns=['dow'],drop_first=True)
Make sure drop_first = True to remove redundant column

Step 2: set up data
dows = [col for col in data if str(col).startswith(‘dow_’)]
dow_cols = data[dows]
no_dow = len(dows)

Step 3: set up model
b = pm.Normal(‘b’, mu=0., sd=0.5, shape=no_dow) # Note no_dow is 1 fewer than the actual number of categories

k = pm.math.matrix_dot(dow_cols,b)

Use theano dot matrix operation to perform the sum multiplication, which can handle matrix of any shape without manually typing out each category.

I think Patsy library can be employed to automate some of these steps, but this will work for now.

Topic		Replies	Views
Custom naming of prefixed output variables Questions	7	1634	July 27, 2023
How can I name the dimensions of my variables? v5 modeling	2	343	September 29, 2022
Regression model using only categorical variables Questions	9	5893	October 26, 2021
Variable name restrictions Questions	0	357	May 1, 2021
Multivariatre categorical variable with different values Questions	3	2329	August 7, 2018

How to name categorical variables?

Related topics