Correlated slopes in multivariate model


Dear all,

I am trying to perform an analysis on a multivariate data set that I will describe below.
My goal is to account for the covariance structure within some of the parameters (“predictors” which influence the observables, e.g. age, temperature, etc.; maybe sometimes referred to as “covariates”) which I acquire along with my measurements of interest (“observables”). The predictor estimates shall then serve as the slopes in a subsequent linear model to infer the observables. I am not fully certain whether that makes sense and if so, how to construct the model.

Your advice to give me a start for this is much appreciated!



starting point

I figured that there is different strategies of decomposing the covariance matrix. In the following examples, I found two different groups of model structures in pymc3, using either LKJCorr or LKJCholeskyCov.
(2) Notebook 12 from video session 6 of the recent presentation by @junpenglao
(3) an older blog post by Austin Rochford
(4) this and this stackoverflow question are related, especially the latter one, but it does not use the values for the predictors to estimate their correlation (see below).

In all of these examples, my trouble starts when attempting to use the observed values of the predictors for modelling of the observables.

A hierarchical GLM with independent predictors would be no problem for me. However, I assume that covariance of the predictors is relevant (but please correct me), based on the notion that the sampling procedure will depend on the coupling of variables (see, for example, here this blog entry by @twiecki ).

data structure

The data structure is as follows:

  • there are multiple observables (multivariate data), but for simplicity let’s start trying to model only one outcome variable
  • there are n_predictor (e.g. 3) predictors that are correlated and potentially influence the observables
  • there are groups of measurements

For simplicity, let’s ignore hierarchy and reduce to a single observable.
Here is toy data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pymc3 as pm
import theano.tensor as tt

pd.set_option('precision', 2)

### generate data
observations_per_group = 100

# we assessed values of three predictors
# which are correlated
predictor_names = ['pred1','pred2','pred3']
mu_actual = np.array([1, -2, 0.5])
n_predictors = len(mu_actual)
covmatrix_actual = np.array([ \
                       [ 1.0, -0.3,  0.0] \
                     , [-0.3,  1.0,  0.6] \
                     , [ 0.0,  0.6,  1.0] \

# data was measured from certain groups, giving it hierarchical structure.
n_groups = 2**4
intercepts = np.random.normal(1.,0.4,n_groups)
# print (intercepts)

# we obtained two measures of interest.
n_measures = 2

data = []
for grp_idx in range(n_groups):
    x = np.random.multivariate_normal(mu_actual, covmatrix_actual, size = observations_per_group)
    x = pd.DataFrame( x \
                    , columns = predictor_names \
                    , index = grp_idx*observations_per_group + np.arange(observations_per_group) \
    x['group'] = 'g%02.0f' % (grp_idx+1)
    for meas in range(n_measures):
        x['measure%i' % (meas+1)] = np.random.normal(intercepts[grp_idx], 0.1, observations_per_group) \
                                + meas * np.random.normal(0, 0.1, 1) # slight difference of measures

    data.append(x.loc[:,  ['group'] \
                        + predictor_names \
                        + ['measure%i' % (meas+1) for meas in range(n_measures)] \

data = pd.concat(data, axis = 0)

data['group'] = pd.Categorical(data['group'], ordered = False)
groups = list(data['group'].cat.categories)
data['group_index'] = data['group']

print (data)
n_observations = data.shape[0] 

example model

Here is my best attempt on the model, following (1):

### model definition
with pm.Model() as model:

    ### following
    packed_chol = pm.LKJCholeskyCov('packed_chol', n=n_predictors,
                                 eta=1., sd_dist=pm.HalfCauchy.dist(2.5))
    chol = pm.expand_packed_triangular(n_predictors, packed_chol)

    mu = pm.Normal('mu', 0., 10., shape = n_predictors \
                    , testval=data.loc[:, predictor_names].mean(axis=0) \
    predictors = pm.MvNormal('predictors', mu, chol=chol \
                            , observed=data.loc[:, predictor_names].values \

    ### glm model assembly
    intercept = pm.Normal('intercept', mu = 0, sd = 10)

    residual = pm.HalfCauchy('residual', 5)
    estimator = intercept
    for pred_idx, pred_name in enumerate(predictor_names):
        estimator = estimator \
                  + mu[pred_idx] \
                    * data[pred_name].values

    ### likelihood
    likelihood = pm.Normal(   'likelihood' \
                            , mu = estimator \
                            , sd = residual \
                            , observed = data['measure1'].values \

### sample and inspect
with model:
    trace = pm.sample(1000, tune = 1000, cores = 4, random_seed = 42)


The trace plot shows mu's of approximately [-0.18, -0.06, 0.15] (actual was: [1, -2, 0.5]), whereas when commenting out the sections of “glm model assembly” and “likelihood”, the inference is correct.
The problem is that observed data is entered twice (in predictors = ... for the model to estimate their correlation, and in likelihood = ... to fit the observables), and both influence the mean of the predictor values. Conceptually, I guess that the posterior distribution of the predictor should instead only be influenced by the measurements of the predictors themselves (independent of the observables).

I assume that the distinction between predictors and observables meaningful, because I need an intercept, and I could think of having multiple, independent blocks within the predictors that separately enter the GLM.


There were slightly other issues in other implementations, but all failed.
I am unsure whether to use LKJCholeskyCov (seems to be the more recent implementation) or LKJCorr (apparently more general) to generate priors for the covariance matrix, but both seem to be interchangeable to a certain degree (see alternative model below). Also, it could be that it is unnecessary to use MvNormal, as it was sometimes done in example (2).

So my rather general questions:

  • Does it make sense to model the predictors covariance as I intend to?
  • Then, which parametrization and which prior specification are useful here?
  • (How) Is it possible to decouple inference for predictors and observables?

Later, I will attempt to extend to hierarchical structure and multivariate observables (which I’ve done before, so that should be easier), but first things first.

Finally, if someone is interested, here is an alternative model definition, using LKJCorr, following (4):

with pm.Model() as model:

    mu = pm.Normal('mu', mu=0., sd=10, shape=n_predictors)
    sigma = pm.HalfCauchy('sigma', 5, shape=n_predictors)

    C_triu = pm.LKJCorr('C_triu', n=n_predictors, eta = 1.)
    C = tt.fill_diagonal(C_triu[np.zeros((n_predictors,n_predictors), 'int')], 1.)
    sigma_diag = tt.nlinalg.diag(sigma) 
    cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
    # tau = tt.nlinalg.matrix_inverse(cov)

    predictors = pm.MvNormal('predictors', mu=mu, cov=cov, observed = data.loc[:, predictor_names].values )

    ### glm model assembly
    intercept_population = pm.Normal('intercept_population', mu = 0, sd = 10)
    residual = pm.HalfCauchy('residual', 5)
    estimator = intercept_population
    for pred_idx, pred_name in enumerate(predictor_names):
        estimator = estimator \
                  + mu[pred_idx] \
                    * data[pred_name].values

    ### likelihood
    likelihood = pm.Normal(   'likelihood' \
                            , mu = estimator \
                            , sd = residual \
                            , observed = data['measure1'].values \


In my notebook, what I did is to have the MvNormal model the latent variable, and have the predictor as the observation of said latent variable. The reason to do so (instead of assigning the observation directly to the MvNormal) is that it allows me to do center vs noncenter parameterization. I also tried your way of modeling it, but it does not work very well (havent figure it out why yet).

I think you might find the discussion in Cholesky decomposition and correlation among random effects between me and @Jack_Caster useful.


Thank you for the response!

Indeed, that discussion (and the traces to here and here) were immensely useful. The latter link summarizes best why covariance structure is necessary and why the posterior still shows the covariance. A lot to wra my head around.

I will follw the useful sleep study notebook by @Jack_Caster and your notebook. If I manage to adapt something for my example, I will post it here.




Dear all,

this story seems to be more complex. I again used “simulated” data and tried to recover the input parameters. Below are some lessons learned. Thank you all, @junpenglao, @Jack_Caster and @DanWeitzenfeld, for sharing your previous discussions and for the reading material.

My current conclusion is that, in a linear model, slope correlation cannot be distinguished from the slope effect itself. Thus, it is not necessary to use multivariate structures for blocks of predictors.

Any comments, disagreeing or affirmative, are appreciated!



(1) relevance/correctness of simulation

In one of the papers or blog posts I read in the course of this (unfortunately I cannot recover which; it was about work flow for building models) it was argued that simulation is an important first step to check if the model layout makes sense, i.e. that it is able to recover the effects it should. I agree, having tried to use my actual data at some points, but because I do not know the actual outcome, being unable to tell whether models were reliable.
A trivial, but important corollary is that you need to set up the simulation right. In my code above, no slopes/effects were added to the measurement. So that was a stupid mistake.
I found this error by visualizing the simulated data (as suggested here, was referred to in @junpenglao video presentation).

(2) matrix notation or not

No criticism at all, but I had trouble following the Cholesky example by @Jack_Caster, although it is perfectly all right. The reason is that I am untrained with the (ANOVA?) terminology, for example I get confused easily by the matrix (greek) letter abbreviations and what is “random” and “fixed”.
Anyways, I found that there is a key difference to my use case: whereas Jack models an interaction of an intercept and a slope (which both can or not be subject to hierarchical variation), I am trying to assess the covariance of different slopes (initially without hierarchy).
I decided to stick with a non-matrix terminology, except of course for the covariance matrix and its decompositions. Jack’s notebooks were generally very useful, I recommend it as an example for others.

(3) LKJCholeskyCov and LKJCorr

These two pymc3 methods differ in the way the covariance matrix is decomposed. One is a Cholesky decomposition (reference), the other is separating the covariance into standard deviations and correlations (reference). Both are initialized with what is called an “LKJ prior”, but I did not dig in yet to find out how that works.

I decided to continue with LKJCholeskyCov because (i) developers suggested it to be computationally more efficient in some forum thread that I missed to bookmark and (ii) thanks to Jack’s notebook, I was able to extract all the diagnostics I needed (mean values, cross-correlation, population level correlation; see attached code).

(4) covariance of parameters and covariance of slopes

In my original post, I tried to feed the observed values for the predictors to the sampler as “observed”. Also, I put in a likelihood for the measurement itself. This was giving a model with two different likelihoods to be estimated.
However, I figure that the way predictor observations entered did not make sense, because there is a difference between
(a) the observed correlation of the predictors themselves and
(b) the cross-correlation of their slopes.
In other words, although the measured values of two predictors can be fully independent, one can theoretically influence the effect of the other for a third observable.

Component (a) implicitly enters the model through the values that are multiplied with the slope parameters. Component (b) is not observed, but needs to be inferred. Previously, my attempt gave the values for (a) but told the model to infer (b), which certainly added to the disfunctionality of the original construct.

(5) simulation result

When incorporating all this knowledge and some change in terminology, I rebuildt the thing and ran the model. Please find the code here (12.6 KB).
It is possible to correctly infer the correlation of the predictor values (4a).
Then, for testing, I set this correlation to zero, and introduced a correlation between the slopes (4b) as follows:

  • The slope of the first predictor is zero.
  • The slope of the second predictor is strongly negative (-0.8).
  • Both are 40% correlated.

However, the model never manages to infer the cross correlation. Instead, cov1 is thought to have a negative slope value that was not there originally.
Intuitively, this makes sense to me, as the matter is ambiguous in a linear model of this composition: whether the change in the observable comes from an effect of the predictor itself, or from a correlated influence of another predictor, cannot be distinguished because all are combined in a linear way (simple multiplication).

Maybe one could tease the model to find cross correlation by manipulation priors (changing “eta” in the LKJ did not work). Certainly the story is different in non-linear models. And I am not sure whether this has implications to the intercept/slope correlation story.
However, because of my working conclusion that the model cannot distinguish, I decided to continue without modeling slope correlation.

Any thoughts on this are welcome!


That’s some great write up, thanks for sharing! You should make it into a blog post!


Thank you for the write up. I bookmarked this thread. I am sort of swamped with work right now, so I cannot really read it thoroughly.

I had (and still have) hard time too understanding how to model the correlation in hierarchical models. I am happy you found my notebooks useful, despite they were not meant to be very pedagogic :slight_smile: . I was meant to polish and publish them somewhere. Never had enough time to do that unfortunately.

I like the matrix notation, as it is very flexible (thanks to the awesome patsy package), but I understand if it does not click for you. There are many different ways to write down hierarchical models, and the matrix notation is one of the most compact and flexible, IMHO. If you look around you will see that packages like brms use this method as well. You may find these tutorial useful; they write down the same model using different parametrizations (including the matrix one).

If you learn new things on the topic, please keep us posted!


Dear Jack,

yes, thanks, I had bookmarked the tutorials you linked for later reading, I think they were in your notebook or other posts. I acknowledge that matrices and patsy are useful, maybe even ANOVA :wink: I just never had the time or the course opportunities or the pressing data at hand to dive in.

I have some hierarchy in my current data, and it is complex, so I will now tackle that and post updates.




Just a quick side note, ANOVA also use the same matrix notation, the difference is the way the matrix is form or the columns are balance. For example, you can have the restriction that each row sums to 1 for categorical predictors (also referred to as sigma-restricted parameterization, which correspondent to a type III ANOVA).
Needless to say, all these different terminology is confusing and difficult to communicate across different field. I think looking directly at the design matrix (or predictor matrix, or covariate matrix) is the best way to know what is the thing you are fitting: the meaning of the coefficients, how to interpret them etc


No time! I expect approximatly 25 years of hard work to come prior to independence of my children. If internet still exists then, I promise to start a blog!

Still, for now, I’ve summarized this in a notebook tutorial.

Two minor questions emerged:

  • how can I incorporate a model residual (\epsilon)?
  • @Jack_Caster, what was the purpose of the “Scaler” in your notebooks?

Thanks once more,



I realized scaler was not a very good name (I am submitting some changes).
Anyway, let’s assume that you want to parametrize your model using the non-centered version, for example to improve sampling and eliminate divergences (see here). Then, you can use this trick: \mathcal{N}(0, \sigma_z) = \mathcal{N}(0, 1) \cdot \sigma_z. In this equation, \sigma_z is the scaler (because it scales samples drawn from the standard normal distribution \mathcal{N}(0, 1)). In a linear mixed effects model, \sigma_z (that can be a deterministic or random variable) may represent the standard deviation of the random intercepts (or slopes). That is, the intercepts for the individual subjects are drawn from the distribution \mathcal{N}(0, \sigma_z).
In case you need it, this also holds true: \mathcal{N}(\mu_z, \sigma_z) = \mu_z + \mathcal{N}(0, 1) \cdot \sigma_z.

BTW, nice notebook, fun to read :slight_smile: