What priors to choose for one hot encoded columns

Hi, I’m making a logistic regression model and I have a lot of categorical variables that I one hot encoded.
My model keeps diverging.
What type of prior should I choose to help it along a little?

Thanks a lot for any insights, tips and hints :slight_smile:

What type of data it is? Is it closely related?
If not then one way code be recoding categorical variables by making clusters to other classes if the categories represent a pattern.
Else try PCA or SVD for dimensionality reduction.
You can also try using L2 regularisation.

1 Like

Hi Oliver-

If you’ve one-hot encoded a lot of categorial variables, it is likely that your model suffers from multicollinearity, meaning that one or more coefficients are non-identifiable. See this post for more information.

The standard recommendation would be to identify the source(s) of multicollinearity and either combine or drop these features to ensure model identifiability.

3 Likes

Definitely, thanks a lot. I dislike One hot encoding because of this since 1 in category X implies 0 in all others. It’s so redundant. But I’m also learning and it was the only way I could think about feeding it into the model.
I will try PCA like mentioned in your other post and above by @5hv5hvnk
Thanks to you both!