How to model/handle hierarchical features in logistic regression

dorian821 · September 25, 2023, 10:19am

I’m working on a logistic regression model that uses two binary features that have a hierarchical relationship to eachother. More specifically, one feature is a general case of the other. Consequently, a crosstab of the features looks like this:

feature_a_general	0	1
feature_a_specific
0.0	3021	459
1.0	0	275

This relationship naturally creates multicollinearity between these two variables, and consequently, results in biased coefficients.

I would like to find a way to model this relationship more explicitly so as to reduce bias in the learned coefficients.

One hacky way I was considering is by “masking” or redacting the values of the general feature wherever the specific feature equals 1.

This results in a crosstab as below:

feature_a_general_redacted	0	1
feature_a_specific
0.0	3021	459
1.0	275	0

Doing this redaction doesn’t change the estimated coefficient for the general variable at all, but significantly changes it for the more specific variable. This is to be expected.

Here are the distributions of the learned coefficients for both approaches:

I would like to better understand:

if this is a valid approach.
how I could do this in a more principled way. for example, by modelling some dependency in the features explicitly.

Topic		Replies	Views
Poor results when using all columns dummy coded categorical features vs using k-1 columns version agnostic modeling	0	361	May 12, 2023
Expand Multilevel Logistic Regression Model to include Individual Covariates v5	3	267	March 6, 2024
Hierarchical models with several of coefficients Questions	8	1018	October 30, 2020
Best logistic model structure for boolean covariates and interactions Questions	1	433	March 8, 2021
Highly correlated variables v5 bambi , modeling	3	519	January 3, 2023

How to model/handle hierarchical features in logistic regression

Related topics