How to model latent variable when individual-level is unknown but group-level is known

I’m interested in understanding why certain users make a purchase on my website while others do not. I’m using logistic regression with a binary outcome (if user has ever purchased or not). I have various user-level attributes like how often they log in. But a variable I’d like to include in the model is something like “household income” which I don’t have access to. I am wondering if/how one would include household income as a latent variable in the model, given publicly available data like gdp/capita at the country level.
So my data might look like:

pd.DataFrame({
    'user_id': [1,2,3,4,5,6,7],
    'country_id': [1,1,1,2,2,1,2],
    'gdp_per_capita': [50000, 50000, 50000, 20000, 20000, 50000, 20000],
    'logins_last_90_days': [24, 58, 4, 15, 64, 25, 10],
    'purchased': [0,1,0,0,1,1,0]
}
)

Does this ask make sense? Would one gain anything by attempting to model the problem like that, or would it make more sense to just use the gdp/capita value as an estimate for a user’s income?

Thanks in advance.

As it stands now it won’t work, because country_id is perfectly correlated with gdp_per_capita. That is, if you tell me what country_id is, I will know with 100% accuracy what gdp_per_capita will be. Usually if you want to introduce a variable like this you introduce a time dimension, where GDP will vary but country ID will not.

Assuming you had some time variation, the macroeconomic quantity you are actually interested in is per-capita household consumption, not GDP (which includes investment and government spending). At the very simplest, people write equations consumption equations like: C = \Omega + bY

Where C is consumption, Y is GDP, and \Omega is household wealth. So you could add a 2nd likelihood function to your model to measure \Omega, then include this latent variable in your data for the logistic regression.

Note that again, without time variation, the estimated \Omega would be perfectly correlated with country ID (since every country would only get one estimate), and it would be “absorbed by the fixed effect” (to use the econometric jargon).

4 Likes

Thanks for the quick and thoughtful answer @jessegrabowski! That makes a lot of sense,

Related/more general question: How would one include country-level information into a logistic regression like this with a hierarchical structure, if there was no time component? Intuitively it seems like it would be valuable to know that country 1 has higher gdp/capita than country 2, and that would influence whether or not a user might convert. But not sure how to do that given that country is perfectly correlated to gdp/capita.

Sorry for replying to this slowly. I took a crack at it a couple times, but I never really thought I hit the mark. I will try again, and I hope others can chime in if I do a poor job.

When you study variation between units with a linear model, you are holding certain aspects of the units fixed, and comparing the variation in the remaining attributes. For example, when you do a regression like \text{Height}_i = \beta_0 + \beta_1 \cdot \text{Female}+ \beta_2 \cdot \text{Weight} + \epsilon_i, you are asking, “what variation in height exists between females of the same weight, given the sample average height?” The fact that variation exists in this dimension allows the parameters \beta_0, \beta_1, \beta_2 to be identified.

So the key idea to all these models, whether it’s boring frequentist OLS with linear algebra or exciting hierarchical models with MCMC, is “what is the remaining variation in the data given a list of factors held constant”.

This should now answer your question about GDP and country. If you have 3 countries numbered 1, 2, 3, and write a model, GDP_i = \alpha_0 + \alpha_1 + \alpha_2 + \alpha_3 + \epsilon_i, I hope you can see that there is no remaining variation in the data from which to measure \epsilon_i, and the model is not identified.

So let’s connect this to your actual problem. From each of these three countries, you have 10 customers, each of whom either purchased or didn’t. You include the “country fixed effects”, or hierarchical intercepts in the Bayesian jargon, and estimate a model:

\text{Purchased}_i = \alpha_0 + \sum_{j\in\mathcal{J}}\alpha_j \cdot \left [ \text{Country}_i = \text{Country}_j \right ] + \beta X_i + \epsilon_i

Where \mathcal J is the set of all countries, and \left [ x = y \right ] is an indicator function that evaluates to 1 if true and 0 otherwise. The key insight is that these variable intercepts in the country dimension capture everything that varies between countries, everything that could be written as I wrote that GDP equation about. Inside these terms is the GDP, but also language, history, culture, institutions – everything that makes a country a country is bundled up and given to you in the \alpha_i term.

The core theoretical idea underpinning all this it might be helpful to know (or not) is the Frisch–Waugh–Lovell theorem, which states that multivariate linear regressions do conditional variance decomposition. The \beta vector from the formula above is not the effect of the X_i on \text{Purchased}_i, it is the effect of X_i on purchased given the \alpha_js! In essence, it is the same as having first done the GDP regression I presented above on every variable in X_i first, then finally run the Purchased regression omitting the \alpha s (this is the point about the M_x matrix in the wiki article, which I linked despite being impenetrably dense in typical wiki fashion; my apologies).

So the short answer to to your question of how to use GDP in this setup is “you can’t”, but the longer answer is “you don’t need to, because the country-specific intercepts already do it”. If you insisted on the issue, for example because you want to know the causal effect of economic growth on consumer activity, then you either have to turn to some clever psuedo-experimental design like instrumental variable regression, or add a time dimension so you can exploit inter-temporal variation in GDP.

The final approach is to use extremely informative priors. All the above conversation is mostly grounded in frequestist statistics, which is at the mercy of the cruel mistress of matrix inversion. In Bayes, we have a bit more flexibility with identification. At core the principles remain the same - you are trying to exploit variation in the residuals to identify fixed sub-spaces in the data - but everything is more “fuzzy” because you get to inject some additional variation via the priors. If you write the regression:

\text{Purchased}_i = \alpha_0 + \sum_{j\in\mathcal{J}}(\alpha_j + \gamma \cdot \text{GDP}_j) \cdot \left [ \text{Country}_i = \text{Country}_j \right ] + \beta X_i + \epsilon_i

Setting a strong prior on \gamma might let you achieve identification and get posterior samples, despite the lack of variation in the data.

2 Likes

@jessegrabowski I really appreciate the time you took here to answer this so thoroughly! This was immensely helpful, so thank you.