Regression model using only categorical variables

dhdcjswo · February 25, 2020, 1:29pm

hi guys.

I am finding a way to make regression model only using categorical variables.
my data consists of a continuous dependent variable and 2 categorical independent variables(one has 20 categories and the other has 1000 categories).
I found several way to deal wit categorical variable but it all about binary variable.

my model would be Hierarchical Model like
y = a[x2_idx]+ b[x2_idx] * x1

if any of you guys know a solution, please let me know.
it would be big help and really appreciated ! thank you!

sammosummo · February 26, 2020, 1:56pm

You need to recode your independent variables.

Your question seems to be not really related to Bayesian modeling or PyMC3 but more of a general one related regression/data analysis. I would recommend looking up regression with categorical variables at a basic stats textbook (or even read the wikipedia article).

cadama · July 28, 2020, 3:35pm

Let me rephrase the question. What is an efficient way of treating categorical variables in pymc3? I am a newbie here and I see a lot of pm.math.dot in linear models examples. Is there a way to do this dot product without first binarising (pd.dummies) your categorical variables? Use case would be models that make use of categorical features with lots of levels.
Thanks

AlexAndorra · July 29, 2020, 9:52am

You can use index variables for variables with many categories: instead of having one indicator variable per category, you just have one index variable and each category has its level – see chapter 5, section 5.3 of Rethinking2 for instance (PyMC port, see code 5.50).
Pandas’ factorize method is very useful for this:

district_id, districts = data.arrondissement.factorize(sort=True)
Ndistricts = len(districts)

Hope this helps

cadama · July 30, 2020, 8:23am

I don’t see the factorize method being used in that part (nor in the notebook), wrong link?

Nevertheless how would you perform the dot product in pymc3 / theano to obtain a linear projection for the location parameter of a pymc distribution?

In one of my attempts I used

X = pd.get_dummies(X, columns=['cate1'], sparse=True)
mu0 = pm.Normal('mu0', 0, 1)
mu1 = pm.Normal('mu1', 0, 10, shape=m)
alpha = pm.Deterministic('alpha', tt.exp(mu0 - pm.math.dot(X, mu1)))
y = pm.BetaBinomial('obs', n=clicks, alpha=alpha, beta=beta, observed=conversions)

no exception rose but the code hangs forever

AlexAndorra · July 30, 2020, 3:55pm

Yeah that looks good – if you don’t encounter any shape error it’s already a very good sign You can also check the size of any tensor with print(pm.math.dot(X, mu1).tag.test_value.shape, this is usually helpful!

But if it’s your first PyMC model, you can also just write down the complete formula b0 + b1 * X1 + b2 * X2... instead of the dot product, to get an intuition of how things work.

I don’t think the NB I linked to uses factorize but it does show how to use categorical variables in regressions. My code snippet does use factorize though – data.arrondissement.**factorize**(sort=True) – and I linked to pandas doc.

Hope this helps

cadama · August 3, 2020, 3:08pm

Thanks. I know how to encode categorical variables in python (pandas, scikit-learn etc.) Nevertheless I don’t know what is the best way to do it in this context (pymc3, theano and in general bayesian models).

Is pymc3 integrated with pandas sparseArrays or scipy sparse matrices?
Should one assign a distribution to each categorical level in a linear (bayesian) regression?
How much is too much for pymc?

The use of predictors in general and categorical features in particular seems to be quite limited in the bayesian context compared to ML in general.

AlexAndorra · August 4, 2020, 9:27am

What did you find confusing in the resources I linked to?
As each category gets its own parameter, each needs its own prior, indeed (but this is the same thing for indicator variable). You can see that in the use of the shape parameter.
You will also find a lot of examples of models with categorical variables in chapter 13

josemrodriguezf · October 15, 2021, 11:08pm

I have a question what would be the best method to put a categorical variable and a lag (previous year’s) of the same categorical variable I deal with climate data and including previous year information is important.

for example x=year with water scarcity/normal year t=time

y = b0[xt,xt-1] + b[xt,xt-1]*X1

drbenvincent · October 26, 2021, 9:10am

Hi @josemrodriguezf
I think this is it’s own question that needs it’s own thread in order to attract folk who know about time series modelling. Feel free to start a new thread

Topic		Replies	Views
Categorical predictor variable, categorical response variable Questions	3	1407	February 5, 2025
Categorical model with continuous dependent variable Questions	10	2917	February 5, 2018
Marginalizing over missing categories Questions	1	709	June 17, 2020
Multi-label categorical predictor?	7	74	December 30, 2024
Partial pooled model with complex categories Questions	3	572	July 12, 2018

Regression model using only categorical variables

Related topics