How to best model hierarchies of multiple categorical predictors

Dear Bayesians,

I would like to learn how you would approach the following situation:

I have data about a special promotion type. Customer have been addressed for certain products. I have a lot of categories that classify the customers, and for the sake of simplicity, I will pick three here to make my question clear:

  • age groups of customers
  • 2-digit zip codes
  • recency of purchase (last 0-2 years, 2-5 years, 5+ years)

For each customer, I have information if she converted or not.

Now, I would like to fit a model in which the influence of these three categories is estimated. The aim of the model is to learn how to select customers for future promotions. For example, you can assume that 1 million customers could be available, but we only want to select the best 100k customers based on these categories.

I could now create a very simple model like this:

"conversion ~ (1|age_group) + (1|zip_code) + (1|recency) ", family="bernoulli", ...

However, some of these “groups” would have really small sample sizes, especially some zip codes. Also, age_group and recency seem to be much more important features.

I would therefore tend to create a hierarchy in which the zip code influence is only estimated in relation to age_group and recency. This could look like this:

"conversion ~ (1|age_group) + (1|recency) + (1|age_group:zip_code) + (1|recency:zip_code)", family="bernoulli", ...

However, this leaves open the relation of age group and recency. These could also be in a hierarchy, and the question is if I should choose (1|age_group:recency) or (1|recency:age_group) here.

Now, this is only an example of 3 categories. You can imagine that this problem will explode when I have 10 categories. Which is dependent on which other?

The question now is how you would approach such a modeling situation. Would you start with a simple model without any conditional variables and then just try out and introduce some hierarchical relations? Given that I have lot of data (millions of rows), the fitting of each model needs quite some time, so I am searching to optimize this “try and error” approach.

I would be interested in how you would go through the modeling process in such a situation.

Best regards
Matthias

I don’t understand the logic here.

Starting simple (even when you know exactly what your final model should look like) is always a good recommendation.

If you don’t really understand how your data “works” and need to try to out lots of different models, I would suggest working with a reduced subset of your actual data. That will allow you to iterate quickly. You can always test models on the full data set (or a larger subset) if they seem promising on the small subset.

2 Likes

I guess this is due to my lack of understanding of mixed effect models. I yesterday did quite some research on all these differences between fixed effects (“conversion ~ age_group”), adding random effects (“conversion ~ age_group + (1|zip_code)”) and interactions like “conversion ~ age_group + (1|zip_code) + (1|age_group:zip_code)”.

I think my question here was how to deal with the situation that for some age_group/zip_code combinations, I only have limited data, and in these situations, the model should focus on the fixed effect of age_group. And as far as I understood, this can be best expressed as

"conversion ~ age_group + (1|zip_code)")

I also understood that the question of choosing a variable as fixed or random effect is sometimes quite tricky, and the same applies for the question if I should model the effects as independent or as interactions. This is probably mostly guided by the understanding of the data and by a workflow “from simple to complex” with several model comparisons on the way.

Best regards
Matthias

If you have very few data points for some zip codes, incorporating an interaction between zip code and age group, instead of them in an additive manner, will only create more sparse groups (i.e. groups with fewer data points). I think the situation you describe, very few data points for some zip codes, supports the usage of a random effect (or partially pooled effect) for zip code. That is what you do with (1 | zip_code).

On top of that, I subscribe to what Christian said regarding how to work with this data. Start with a simple model, grow it step by step, and use a subset of the data if the data size is too large. If you’re in the regime of millions, and you want to have a hierarchical model with several grouping variables, I think Bambi will be very slow. If all you have is categorical predictors (i.e. grouping variables) you can use a Binomial family, after computing number of successes and trials per group. That should make things faster as there will be a smaller number of data points. Or, you can try to use PyMC, which won’t be faster per se, but it would give you more control over implementation details.

2 Likes

In these cases, it really helps to (a) use partial pooling, (b) use sensible priors, and (c) make sure everything is identified.

For example, for zip codes, you probably want to build a hierarchical ICAR prior to smooth adjacent zip codes.

For the other discrete groupings, you need to either pin one of the values to 0 or enforce a sum-to-zero constraint (I think the latter is harder than the former in PyMC, but it has the advantage of being symmetric in the prior).

+1 to this advice.

And no, you don’t want to use fixed effects for groups unless they’re ordinal at the very least.

The interactions get even harder to fit as the data for them gets sparser.

It’s usually pretty simple if you think about what the covariates mean. If you introduced a fixed effect, you assume the effect is linear in the value. This usually doesn’t make sense if the categories are not even ordered, like “vanilla, chocolate, strawberry”. The one case you have where it’s confusing is age, where you might consider a fixed effect on the index of an age group (say 1–5). But if the response can be non-linear (e.g., popular with middle age people, unpopular with older or younger people), then a fixed effect is not appropriate. In general, the random effect is strictly more expressive if there are only finitely many values.

Your example introduces a fixed effect for age group but you are probably going to be better off with 1 | age_group instead of a fixed effect for the reasons mentioned above. Either way, you are not prioritizing either covariate. On the other hand, regularization through partial pooling will do the right thing of regularizing estimates to population averages when there isn’t much data for a group.