Cholesky decomposition and correlation among random effects


#1

I was reading the paper Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists. In section 2.3 Varying Intercepts, Varying Slopes Mixed Effects Mode, paragraph Defining a variance-covariance matrix for the random effects they suggest to define a covariance matrix to capture the correlation between the random effects (intercept and slope).

I have been playing around with random effects model (see for example previous discussion on design matrix), and I obtained nice results, without implementing the covariance matrix. The approach I took was to define two separate design matrix for the random effects (1 for intercept, 1 for slope) to have more control on the priors, I parametrized the model in non-centered way (as done in http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/), and then sampled. The correlation between intercept and slope was not modeled directly, but it could be inferred from the posterior (intercept and slope are likely to be correlated).

I then tried to apply the method suggested in the paper, and I obtained similar results. And it seems to me that this approach is basically an alternative way to parametrize the model in non-centered way. That is, we sample from two independent distributions (1 for intercept, 1 for slope) and then we get the correlated parameters estimation with the covariance matrix (computed via the Cholesky decomposition for computational reasons).

Is this true, do I understand correctly?


Correlated slopes in multivariate model
#2

This question was raised before here Covariance of multiple varying coefficients in hierarchical models and I meant to write down some though but never manage to. It is something used to confuse me as well, and now my understanding is quite similar to yours.

Classically, the covariance of the betas/coefficients (regardless it is for the fixed effect or the random effect) usually comes into play when you are computing the contrast or linear combinations among them (eg. ANOVA). This process itself is projecting the contrast to some space using the projection matrix (rotation and scaling etc) derived from the covariance matrix. In Bayesian LME, since the posterior samples already in the right space, we can compute the contrast directly using the sample to have the correct marginal. But if we model the covariance specifically, then we should rotate the posterior when we are computing different contrast.


#3

I do not follow here. Once I have the Cholesky matrix chol and the independent components gamma_Z_raw, I can get the correlation by multiplying them together; something like: gamma_Z = tt.dot(chol, gamma_Z_raw). Then I can compute any contrast I like, right?

On one hand, it seems like more work (and at least in my experiment it takes 3x longer to fit the same model), so the impression that I get is that we should not bother too much. On the other hand, Bayesian packages like STAN and bmrs seems to rely quite much on the Cholesky decomposition and LKJ prior. For example, at page 7 in the vignette of the brms package it reads

By default, group-level coefficients within a grouping factor are assumed to be correlated. Correlation can be set to zero by using the (coefs || group) syntax.

with correlation computed via Cholesky and LKJ prior. So I am wondering if I am missing something out.
I am thinking: can it be that we need to use the correlation matrix when we sample from a multivariate distribution (e.g., MvNormal) but it is unnecessary when we sample from an array of independent distributions (e.g., normal(..., shape=2))? Maybe I am completely off track…


#4

Yes.

If you have normal(..., shape=2) then there is no point to sample from a MvNormal. The default in brms is correct, and you do need to do it when you have more than one random effect within the same grouping, for example, y ~ x1 + (1 + x2 | subjects). In these cases, the shape of the random effect coefficients are (2, nsubjects), which should be sampled from a MvNormal with a 2*2 correlation/cov matrix.
My personal use case is usually random intercept effect (e.g., y ~ x1 + (1 | subjects)), so I dont use this often, but if the MvNormal for random effect is necessary if you have multiple nested random effect and enough groups.


#5

First of all, Thank you for taking time to answer :slight_smile:

I am confused now… but I really want to understand. So bear with me!

In my test case I had exactly the scenario you are talking about y ~ x1 + (1 + x2 | subjects). However, I did not modeled the covariance matrix, but I sampled intercept and slope from an array of distributions, something like:

gamma_intercept = pm.Normal(..., shape=n_subjects)
gamma_slope = pm.Normal(..., shape=n_subjects)

which I then integrated in my likelihood using two design matrices, \mathbf{Z}_{intercept}\gamma_{intercept} + \mathbf{Z}_{slope}\gamma_{slope}
And because, as you said …

In Bayesian LME, since the posterior samples already in the right space, we can compute the contrast directly using the sample to have the correct marginal

… I can perform my analysis (contrasts and whatnot) directly on the posterior samples. Intercept and slope for each participants are, in fact, correlated, without having to model the correlation matrix:
image

From your last answer, however, it seems you are saying that when I have more than one random effects as in this case (i.e., intercept and slope) I should, in fact, model the covariance matrix and sample the intercept and slope from a multivariate normal distribution. What am I missing :thinking: ?


#6

Here in the plot you are only showing the correlation between the intercept and slope of 1 subject/group right? Because if you have 10 subjects/groups you will need 10 such plots to display the correlation between the random intercept and the random slope. So it you model the Correlation/cov matrix explicitly, you can get the overall corr between intercept and slope across subjects (otherwise you need to estimated afterward from the posterior sample, which could be suboptimal). But my previous point remain that if you are computing contrast, with or without corr matrix explicitly model you get similar result.


#7

That’s exactly what I did :wink:

Oooh, I think I get it now. If I were interested in the overall parameter correlation across subjects, and not only on the correlation within subject (as shown in the plot before), then the correlation matrix would help.

I run again my script and I got that the covariance matrix \Sigma = \mathbf{CC}^\rm{T} = \bigl( \begin{smallmatrix} 23.9 & 0.7 \\ 0.7 & 5.8 \end{smallmatrix}\bigr), where \mathbf{C} is the Cholesky matrix. I can then interpret that the the intercept has the highest average variance across subjects (23.9), and the average covariance is 0.7.


#8

I was looking at the results I got from the implementation of the covariance matrix.
The posterior predictive check looks good, suggesting that the model is correct (in blue the empirical data, in gray the posterior generated data):

The covariance matrix is \mathbf{\Sigma} = \bigl( \begin{smallmatrix} 23.8 & 0.82 \\ 0.82 & 5.75 \end{smallmatrix}\bigr) (the values are the average of the samples in the trace). I then extracted the correlation matrix \mathbf{R}=(diag(\Sigma))^{0.5} \Sigma \space (diag(\Sigma))^{0.5} and I obtained a Pearson coefficient of \rho=0.07 — which should be the average expected correlation between intercept and slope across participants.

This value of \rho=0.07, however, is not correct. In fact, the Pearson coefficient of each participant is about -0.5:

I am quite confident the covariance matrix has been computed correctly (from the Cholesky matrix \mathbf{C} as \mathbf{\Sigma=CC^\rm{T}}); same for the correlation matrix. Do you guys spot the error :thinking: ?


#9

Hmmm the computation seems all correct - did you look at the same scatter plot between random intercept and random slope in the model with correlation matrix?


#10

Yes, that’s the graph I attached before. It shows the correlation between intercept and slope for each participant, computed from the correlation matrix. The notebook is here in case you are interested/have time to check :wink:
I am sure I did a silly mistakes somewhere, strange though that the fit looks alright…


#11

Right - in that case it seems the correlation is all “loaded” in the posterior correlation between the sample, which should be captured by the correlation matrix in theory…

[EDIT]: the problem is you are correlating the random effect twice in:

# independent component for each subject and each random effect
    gamma_Z_raw = pm.MvNormal('gamma_Z_raw', mu=np.zeros(n_random_effects), chol=chol, shape=(n_subjects, n_random_effects))
    
    # Compute the correlated random variables via the covariance matrix
    gamma_Z = tt.dot(chol, gamma_Z_raw.T)

#12

I guess you spotted the error! Oh man, I guess I mixed the centered and non-centered parametrization there :sweat:
So either I do

# centered parametrization
gamma_Z = pm.MvNormal('gamma_Z_raw', mu=np.zeros(n_random_effects), chol=chol, shape=(n_subjects, n_random_effects))

or

# non-centered parametrization (see: https://docs.pymc.io/api/distributions/multivariate.html?highlight=mvnormal#pymc3.distributions.multivariate.MvNormal)
gamma_Z_raw = pm.Normal('gamma_Z_raw', mu=0, sd=1, shape=(n_subjects, n_random_effects))
gamma_Z = tt.dot(chol, gamma_Z_raw.T).T

I ran the notebook again. The sampling is way faster, but the correlation is still off (\rho=0.11). Mmmm, maybe the variance in the intercept is quite high, which may suggest that something is off

[ [596.98838465,  16.46747156],
  [ 16.46747156,  36.57017179] ]

What strikes me is that the results seems very good anyway, even with the error before (which is worrying in a way). I guess there is something wrong in the code that the sampler manages to recover anyway…

I guess I need to look at that tomorrow with a fresh mind.


#13

That’s what often happen when you quantifying the performance by some function of y - \hat{y}. I often find that even if the parameter estimation is off the average prediction of the observed is very close. With your error, the covariance matrix kind of divide into two part - with a bit of algebra you might be able to recover it - so you would have the wrong estimation, but similar prediction.


#14

I tried again (here the updated notebook). This thing starts to bug me.
I found a similar example in the Statistical Rethinking PyMC3 port (code 13.12). The difference is that I use the non-centered parametrization (but I also tried the centered parametrization with same results.). Both implementations looks similar.
But I still get an aggregate correlation \rho=0.08 from the covariance matrix which is far away from the value I get by computing the correlation from the samples of intercept and slope in the trace (about \rho=-0.5). I really cannot spot the error, if any!

  1. My understanding is that both \rho values should be sort of the same, around 0.5. Right?
  2. Do you have any suggestions on how to debug this problem? As the posterior check looks good, there is no a striking result which may hint to an error.

#15

As you can see from the Statistical Rethinking PyMC3 port, this approach could estimate the correlation between random effect correctly. So maybe our intuition is not correct - actually, you cannot estimate the correlation between the random effect by looking at the correlation of their posterior samples.

What remain to varify is:

  1. is the model correctly implemented? Comparing the result with Stan/brms would be helpful
  2. compare the two:
  • model 1 with no correlation matrix of the random effect, for each MCMC sample, compute the correlation \rho_a between the random effect (that has a shape = (nsubject, 2))
  • model 2 with correlation matrix explicitly model for the random effect, extract the \rho_b
    \rho_a should distributed similar to \rho_b

To further clarify why I would compare the two: what you are plotting is the correlation of the random effect within each subject, which is not what the correlation among the random effect in the model capturing.


#16

I found a brms implementation already made in this blog post. The results indicate that the correlation is, in fact, \rho=0.06 [-0.51, 0.69]. The value I found with my Cholesky implementation is \rho=0.07 [-0.45, 0.59] which seems pretty close (I will run further test myself later).
I found also a rstanarm implementation: here \rho=0.07. In the same post he plotted the individual correlation for each participant


which looks like the ones I plotted too (with a correlation of \rho=-0.5). So I would conclude, as you said, that we cannot estimate the correlation between the random effect by looking at the correlation of their posterior samples

I obtained values that seem wrong. The following represents the distribution of \rho computed from each MCMC sample as

rho = []

for i in gamma_Z:
    rho.append(np.corrcoef(i.T)[0, 1])

where gamma_Z has shape (nsubject, 2). np.corrcoef assumes that each row of represents a variable, and each column a single observation of all those variables, hence the transpose T command.

I do not know how to interpret this result. Does it mean that modelling random effects without correlation matrix is wrong? As suggested in the post you linked Covariance of multiple varying coefficients in hierarchical models we are posing a strong prior on the random effects to be independent (but the data attempt to say otherwise).

image

Given the above, \rho_a does not agree well with \rho_b. But \rho_b agrees with the brms results.

I then check whether computing the \rho from the MCMC samples, agrees with \rho computed from the cholesky matrix

    covariance_matrix = pm.Deterministic('covariance_matrix', tt.dot(chol, chol.T))
    
    # Extract the standard deviations and rho
    standard_deviations = pm.Deterministic('standard_deviations', tt.sqrt(tt.diag(covariance_matrix)))
    correlation_matrix = pm.Deterministic('correlation_matrix', tt.diag(standard_deviations**-1).dot(covariance_matrix.dot(tt.diag(standard_deviations**-1))))
    rho = pm.Deterministic('rho', correlation_matrix[np.triu_indices(2, k=1)])

image

I would expect them to be almost identical, but they are not. Maybe, in general, it is not a good idea to compute the correlation from MCMC samples then.


#17

Thanks for the detail investigation! I guess that concludes that: the literatures were right, you do need a cov matrix for random effect.

A follow up analysis would be to compare LOO and WAIC between the two models. As visually the model without cov for random effect seems to fit fairly good as well.


#18

I have to thank YOU, and the awesome job you do at helping the community!

Yes, I think I should do that. And wrap up the analyses we did in a nice notebook. Thank you again.


#19

I ran a quick test. It seems the “simpler” model, without modelling the covariance matrix, is slightly better then the one that includes the covariance matrix. On both models I got some warnings when computing WAIC and LOO, so the results may not be super trustworthy. Anyway, it gives an idea I guess.

Model WAIC LOO
w\o covariance matrix 1718.72 1722.01
w\ covariance matrix 1719.59 1723.77

Further investigation is needed :stuck_out_tongue:


Covariance of multiple varying coefficients in hierarchical models
#20

Hi @junpenglao, I asked in the STAN forum if it is possible to recover the correlation coefficient from the posterior samples (see here). In theory our approach is correct, but we would need way more samples to get a proper estimation.