Mean centering of time-varying predictors in Bayesian hierarchical models

Mike_Johns · October 11, 2022, 7:46pm

This is a general question about using mean centering to handle time-varying predictors in a hierarchical Bayesian model for longitudinal data. This is NOT a question about reparameterizing a model using the ‘centering’ trick to deal with funnels in joint distributions.

To motivate this question, imagine you have a dataset where a group of people rated their sleep quality and recorded the number of alcoholic drinks they had the day before for 30 days. I want to estimate the relationship between alcohol consumption and sleep quality over time. I land on a multilevel model with random intercepts and slopes where time points are at level 1 and people are at level 2.

One recommendation for dealing with a time-varying predictor like alcohol consumption would be to create two versions using mean centering: one version is person-centered, where you subtract a person’s mean alcohol consumption from all their values (Xi - X_bar); the other is grand mean-centered, where you take the person’s mean alcohol consumption and subtract the grand mean of alcohol consumption (X_bar - X_gm). You would then enter the person-centered version as a level 1 predictor and use the grand-mean centered version to explain random intercept and slope variance. Person-centering is designed to isolate the within-person variation and grand mean centering isolates between-person variation. If instead you entered alcohol consumption into the model as a level 1 fixed effect without mean centering, the resulting estimate would capture a mixture of within person and between person variance. This has been referred to as a conflated effect (Preacher, Zhang, & Zyphur, 2011) or smushed effect (Hoffman, 2015).

A lot has been written about mean centering in the MLM literature, and there is a lot of debate and argument about when and how to use mean centering for substantive reasons beyond just making the intercept interpretable. However, I’ve not seen any discussion of this topic in Pymc materials/examples or the larger Bayesian literature on MLMs for longitudinal data. @drbenvincent recently mentioned this topic in his wonderful tutorial on moderation analysis but the central issue there is slightly different.

So, my question is: Why isn’t mean centering discussed in relation to Bayesian MLMs for longitudinal data? Is it because mean centering isn’t really needed in a Bayesian MLM? Or, is it just an artifact of the way people think about and approach MLMs in Bayesian stats?

I would be really interested to hear others perspectives on this topic and why it doesn’t get much attention in the Bayesian world. I’m also interested to hear if, in fact, mean centering isn’t really necessary in a Bayesian model of the sort described above, and why that is.

lucianopaz · October 12, 2022, 1:27pm

I’m not an expert in mean centering, but I can give you my take on why I usually don’t mean center the data itself.

Bayesian models are intimately linked to an assumed data generation process. In multi level models (MLM) an assumption that is usually never explicitly stated but is almost always implicitly present is that the members of a certain level are exchangeable between each other. Take your example, when you say that you have an MLM with subjects, you are assuming that subject 1 and subject 2 are indistinguishable a priori, and that their parameters come from a common distribution. This also holds for any other subject in the observed dataset, but it is also assumed to be true for subjects outside of the observed dataset. At the core of this assumption is something called De Finetti’s theorem. If you want to read more into this exchangeability assumption, I recommend that you have a look at Michael Betancourt’s beautiful case study.

When you center X, what happens is that your model will now work on a dataset where the data generation procedure for X does not adhere to exchangeability. This is easy to prove because,

\tilde{X_{i}} = X_{i} - \frac{\sum_{j=1}^{N}X_{j}}{N} = X_{i} - \bar{X}

So, if you assume a priori that all X_{i} have the same variance, and are i.i.d between each other

Corr(\tilde{X_{i}}, \tilde{X_{j}}) = \delta_{ij} - \frac{1}{N}

So now every centered \tilde{X_{i}} is correlated with the ones of every other subject. As a consequence, some of the parameters of an MLM that you’ve written for a centered dataset will not be equivalent to their counterparts in the non centered variant, and as a consequence, it makes it harder to make predictions on held out subjects, or to make statements on predicted effect sizes.

That being said, you can write down an MLM that does work on the non centered data, but includes both a grand-mean and a subject mean intercept to be learnt. This way, you can attempt to distinguish the within-person, between person variation effects and parameters, but still be able to make predictions on held out subjects. Writing the model down isn’t complicated, for example you could do something like:

\alpha_{gm} \sim \operatorname{Normal}(0,1)\\ \alpha_{i} \sim \operatorname{Normal}(0,1)\\ \alpha_{i,t} \sim \operatorname{Normal}(0,1)\\ \sigma_{gm} \sim \operatorname{HalfNormal}(1)\\ \sigma_{i} \sim \operatorname{HalfNormal}(1)\\ \beta_{i, t} = \alpha_{gm} + \sigma_{gm}\alpha_{i} + \sigma_{i}\alpha_{i,t}

where you can see the between subject variation learnt by \alpha_{i} and \sigma_{gm}, and the within subject variation learnt by \alpha_{i,t} and \sigma_{i}. The real problem is that trying to do something like this introduces many model degeneracies. What I mean by degeneracies are combinations of parameters \alpha's and \sigma's that produce exactly the same \beta_{i,t}. The combination of the priors and the observed data will be responsible for making the model converge or not. If you are interested in reading more on this, I’ll point you to yet another one of Michael Betancourt’s case studies. This all is similar to the conflated or smushed effect that you mentioned, because there will be correlations across the parameters of different levels.

There are cases in which you cannot actually bypass the degeneracies and you need to resign yourself to not being able to accurately estimate within subject variation parameters or between subject variation parameters that adhere to the assumed exchangeability. In those cases, the newly introduced ZeroSumNormal distribution shines. Keep in mind, that some of the learnt parameters in those situations will not be generalizable to held out subjects, because the exchangeability assumption is broken. I don’t really know how other people deal with that though.

drbenvincent · October 12, 2022, 2:09pm

So I think the reply by @lucianopaz is very thorough. So I’m not adding anything new as such, but here are some thoughts:

I prefer to not pre-process data and favour explicitly modelling the data generating process.
Mean centering (and not inferring a grand mean) can be thought of as removing degrees of freedom, therefore making parameter estimation easier by reducing parameter correlations in the posterior.
Mean centering data could be thought of as similar to modelling a grand mean and individual level deflections but where there is zero uncertainty about what the grand mean actually is (i.e. it is defined empirically by the observed training set data). This assumption is then ‘messed up’ if you start to consider novel test set data. But the way how @lucianopaz describes this is more elegant.

I don’t know if we’ve fully answered your question in terms of time varying predictors, but all I’d say is that the line \alpha_{i,t} \sim \text{Normal}(0,1) could be ‘improved’ by specifically incorporating prior knowledge about how the parameter changes over time. And by ‘improved’ I mean that by imposing more of your prior beliefs into the model, it may help parameter unidentifiability/correlation issues.

PS. We hope to put out some extensive information about the ZeroSumNormal soon.

Mike_Johns · October 14, 2022, 1:43pm

Thanks for the thoughtful and detailed responses! Some of the points makes sense, like learning the means within the modeling framework, but I’m not totally sure I follow the entire logic. In a nutshell, it sounds like you are saying that mean centering would induce a correlation between the observation units (people tracking their sleep and alcohol intake, in the current example) and violate the assumption that these units are i.i.d.This would then produce biased parameter estimates that will not generalize, undermining the ability to make out-of-sample predictions. Is that a reasonable summary of your explanation?

The idea that centering on the grand mean would create a relationship between units and undermine the i.i.d assumption makes sense intuitively. What’s interesting is that in the literature I’m drawing on, the recommendation is merely to subtract a constant from the person mean, Xi_bar. The grand mean is usually used for convenience to help improve the interpretation of the results. However, you could use any number. For example, in the alcohol-sleep study case, you might want to use 1, which is the minimum number of drinks someone need consume to be considered a ‘regular’ drinker in the epidemiology literature. If my understanding is correct, I don’t see how ‘centering’ would be a problem given this number is not inherently linked to the observation units.

The other thing I’ll note is that in social-behavioral fields like education, psychology and sociology, MLMs are used exclusively for inference about population parameters as a means of answering substantive scientific questions. Prediction is never the goal. This might be part of the reason why the issues you raise aren’t considered important or problematic in the non-Bayesian world. You note that parameters in the centered model will not be equivalent to the non-centered version, but that’s generally the goal. In other words, the parameters that would be most useful for prediction are not necessarily the most useful for answering the core research question. I’ve encountered this tension at various times and it rings true; sometimes the ideal model for prediction is not the ideal model for inference.

This is all really interesting. Lots of food for thought!

Topic		Replies	Views
Per subject mean centered X and y for fixed effects model causalpy	6	28	February 4, 2025
Are there instances when reparameterizing a model makes it slower? Questions	4	660	December 29, 2020
Centering data using pm.Deterministic	8	66	February 15, 2025
Convergence with noncentered hierarchical model Questions	6	799	July 24, 2021
Feasibility of large hierarchical multiple regression v5 modeling	4	284	October 24, 2023

Mean centering of time-varying predictors in Bayesian hierarchical models

Related topics