What priors to choose for one hot encoded columns

Oliver · April 12, 2022, 4:06pm

Hi, I’m making a logistic regression model and I have a lot of categorical variables that I one hot encoded.
My model keeps diverging.
What type of prior should I choose to help it along a little?

Thanks a lot for any insights, tips and hints

5hv5hvnk · April 12, 2022, 7:23pm

What type of data it is? Is it closely related?
If not then one way code be recoding categorical variables by making clusters to other classes if the categories represent a pattern.
Else try PCA or SVD for dimensionality reduction.
You can also try using L2 regularisation.

chartl · April 12, 2022, 7:32pm

Hi Oliver-

If you’ve one-hot encoded a lot of categorial variables, it is likely that your model suffers from multicollinearity, meaning that one or more coefficients are non-identifiable. See this post for more information.

The standard recommendation would be to identify the source(s) of multicollinearity and either combine or drop these features to ensure model identifiability.

Oliver · April 13, 2022, 7:15am

Definitely, thanks a lot. I dislike One hot encoding because of this since 1 in category X implies 0 in all others. It’s so redundant. But I’m also learning and it was the only way I could think about feeding it into the model.
I will try PCA like mentioned in your other post and above by @5hv5hvnk
Thanks to you both!

Topic		Replies	Views
Poor results when using all columns dummy coded categorical features vs using k-1 columns version agnostic modeling	0	361	May 12, 2023
How to model/handle hierarchical features in logistic regression version agnostic modeling	0	369	September 25, 2023
Covariates in PYMC-Marketing CLV Model v5 pymc-marketing	2	134	July 26, 2024
Regression model using only categorical variables Questions	9	5938	October 26, 2021
Prior on binary and ordinal variables in multiple linear regression version agnostic modeling	2	999	March 6, 2022

What priors to choose for one hot encoded columns

Related topics