Preferred sampler for categorical predictors?

localh · April 13, 2021, 2:16am

I have a linear model that is entirely categorical variables turned into dummy variables and I am wondering if someone has a recommended sampler. I have been avoiding NUTS and playing with Metropolis and HamiltonianMC, however there are many to pick from and I am wondering if anyone has some advice for this general setup.

cluhmann · April 13, 2021, 2:39am

Is there some specific reason you are avoiding NUTS? Model with dummy-coded predictors (typically) still have continuous parameters (coefficients), so NUTS would be my default (and is probably selected by pm.sample() automatically). But if you have some reason to look elsewhere, knowing what that reason is might help point you in the right direction.

localh · April 13, 2021, 2:44am

Ah, I was under the impression that NUTS does not play well with categorical data and thus should be avoided!

cluhmann · April 13, 2021, 3:02am

NUTS does not play well with categorical parameters, but is otherwise fine with categorical data. So if your data is dummy-coded and you have continuous coefficients, you should be fine. In general, pm.sample() will automatically try to select a reasonable sampling scheme by inspecting your model. NUTS tends to be much better than the alternative MCMC algorithms when it’s available.

localh · April 13, 2021, 3:10am

Thanks so much!

jonsedar · April 13, 2021, 9:55am

Just a sidenote from experience, if you’ve a lot of categorical features and/or features with many levels leading to many linear parameters, you’ll probably be well-served to introduce partial-pooling e.g. GLM: Hierarchical Linear Regression — PyMC3 3.11.2 documentation

Lime · April 13, 2021, 5:14pm

If I had missing categorical predictors or categorical features, which sampler would be best for that?

cluhmann · April 13, 2021, 5:31pm

Ultimately, the best sampling scheme depends on what your model is. Just knowing what your data is like doesn’t give you any strong hints about how best to sample. So if you have a specific model you have questions about, feel free to post it and someone can weigh in on how best to proceed.

jonsedar · April 14, 2021, 2:24am

I’m not sure the sampler matters so much as the model construction, and then your sampler would simply be a consequence of that.

Handling missing values in a categorical is an interesting problem. AFAIK under the current missing value imputation, the features are imputed independently - so you probably wouldn’t want to try to impute a {0,1} for each column in a one-hot-encoded set of columns because you’d also have to enforce them to sum to 1. (Though I suppose you could do that with a switch and Potential)

If you’re partial-pooling the categorical factors, then you have an indexing feature, so perhaps you could try to impute that, requiring that it comes from a discrete uniform distribution {0, …, max_index_value}. Naturally you’d have to use a discrete-friendly sampler for that particular feature.

Mcb · April 14, 2021, 12:04pm

Marginalizing discrete variables is a nice solution to avoid subpar samplers. Doing the marginalization, you can always sample the discrete variables from the marginal probabilities.

jonsedar · April 14, 2021, 3:06pm

Yeah, cool - I hadn’t thought of it that way. Imputing missing values in a categorical feature is a bit like (I think) assigning a latent single-member cluster label, and the non-missing values act to seed the cluster labels. Maybe…

Topic		Replies	Views
Step Function for model with Categorical RV Questions	1	303	June 12, 2021
CategoricalGibbsMetropolis samples differently when specified explicitly Questions	2	380	October 10, 2019
Modelling Categorial Variable Questions	2	702	October 17, 2019
pm.Categorical with sample_numpyro_nuts v5 jax	3	368	November 24, 2023
Marginalizing over missing categories Questions	1	715	June 17, 2020

Preferred sampler for categorical predictors?

Related topics