Poor results when using all columns dummy coded categorical features vs using k-1 columns

dorian821 · May 12, 2023, 2:46pm

Hi,

I’ve observed a significant difference in a logistic regression model that includes some categorical features which have been encoded as dummy variables when using all dummy columns vs using all except 1. In the latter case, the chains are much tighter and cleaner, no divergences and rhat scores are all 1, while in the former case lots of divergences, ugly chains, and rhat of 1.05 - 1.12.

Note, the features are some independent binary features + a categorical variable that’s dummy encoded.

I have two questions:

What is the reason for this?

As I’d prefer to use the better identified model, how do I estimate the impact of the left out category from the categorical variable?

Here are some plots of the traces of the two models.

Model with the all dummy features from the categorical:

Model with one dummy feature dropped.

Topic		Replies	Views
What priors to choose for one hot encoded columns	3	923	April 13, 2022
Dummy coding scheme matters in ADVI logistic regression, but not in MLE Questions	3	765	October 31, 2017
Covariates in PYMC-Marketing CLV Model v5 pymc-marketing	2	108	July 26, 2024
How to model/handle hierarchical features in logistic regression version agnostic modeling	0	366	September 25, 2023
Regression model using only categorical variables Questions	9	5888	October 26, 2021

Poor results when using all columns dummy coded categorical features vs using k-1 columns

Related topics