Difficulty estimating latent correlations

alj · February 10, 2025, 9:27pm

Hello PyMC community!

I am having some difficulty estimating the correlations between latent scores, or I guess equivalently random effects. Let me set the stage a bit to describe the problem:

I have two behavioural tasks that a group of participants respond to. One task is scored as being correct/incorrect, so Bernoulli distributed, while another is the level of endorsement the participant has to a set of questions, an ordinal Likert scale. In my field, the typical approach is to sum the scores on each task, per participant, and then correlate them. Typically, this results in relatively weak associations, and some folks think this could be due to trial-level variation being ignored by the sum score approach (work by Jeffrey Rouder for example who may be known to readers).

Enter the hierarchical model! My aim was to use Bayesian inference to estimate a set of latent scores - two per participant, one on each task - that are correlated, coming from a multivariate normal. These are then used as scores that plug in to each participants’ set of data in each task, which are evaluated under their appropriate likelihood (a Bernoulli for the one task, and an OrderedProbit with cutpoints for the other).

Using this architecture I’ve successfully estimated a model and the latent correlation, but it rings alarm bells as almost doubles in size (from about .40 to .85!), which has serious ramifications for practice in the area! Literally the definition of ‘huge if true’.

I have two problems, one theoretical and one technical. The first is that this seems too good to be true. I’ve tested the approach on data where the correlation is theoretically expected but quite weak, and it goes up a lot in this approach. The trouble is this approach also bumps up the correlation between two tasks that have a theoretically-sensibly low (sum-score) correlation, from about .15 to .6. I really need a sanity check as to whether the structure is correct. I’ve followed @tcapretto’s excellent blog on simulating data with PyMC using do and observe and the structure checks out, recovering any latent correlation I give it from the raw data.

The second problem is I am not sure which parameterisation to use. I get different results with pm.LKJCorr and pm.CholeskyCov. Unfortunately I am at my limit with the mathematics there and when to pick one over the other - I know there have been changes to both distributions that make them a bit more user friendly in recent updates, but I worry this is the source of my issues.

A fully reproducible example is below using a slice of data from the full set - any thoughts/comments are massively appreciated before I make a fool of myself at a conference in the near future. PyMCheers!

import arviz as az
import pandas as pd
import numpy as np
import pymc as pm
import scipy.stats as st

rng = np.random.default_rng(
    sum(map(ord, 'estimating latent correlations is tricky!'))
)

# Read in data
df = pd.read_csv('https://raw.githubusercontent.com/alexjonesphd/py4psy2024/refs/heads/main/test_data.csv', dtype={'Response': int})

# Estimating a simple sum-score correlation
simple_corrs = (df
                .groupby(['Participant_ID', 'TrialType'], as_index=False)
                .agg(score=('Response', 'sum'))
                .pivot(index='Participant_ID', columns='TrialType', values='score')
                .apply(st.zscore, ddof=1)
               )

simple_corrs.corr() # ~ .41 approx

# A hierarchical model with two likelihoods 
# Data and coords etc
cfmq = df.query('TrialType == "cfmq"').reset_index(drop=True)
cfmt = df.query('TrialType == "cfmt"').reset_index(drop=True)

# Pull out indexes and labels - the labels will be identical between tasks, the indexers will differ
cfmq_pid_idx, cfmq_pid_label = cfmq['Participant_ID'].factorize()
cfmt_pid_idx, cfmt_pid_label = cfmt['Participant_ID'].factorize()

# Set coordinates
coords = {'cfmq_pid': cfmq_pid_label, 
          'cfmt_pid': cfmt_pid_label,
          'cfmt_nobs': cfmt.index,
          'cfmq_nobs': cfmq.index,
          'task': ['cfmt', 'cfmq']}

# Model block
with pm.Model(coords=coords) as hierarchy:

    # Correlation - not sure which of these two approaches is correct 
    corr = pm.LKJCorr('corr', n=2, eta=1, return_matrix=True) # this one suggests a correlation of about .78
    latent_scores = pm.MvNormal('latent_scores', 
                                mu=[0, 0],
                                cov=corr,
                                dims=('cfmq_pid', 'task'))
    
    # # # Correlation - this approach suggests about .85
    # chol, corr, sigma = pm.LKJCholeskyCov('cmat',
    #                                       eta=1, 
    #                                       n=2,
    #                                       sd_dist=pm.HalfNormal.dist(3),
    #                                       compute_corr=True, 
    #                                       store_in_trace=True)
    # latent_scores = pm.MvNormal('latent_scores', 
    #                             mu=[0, 0],
    #                             chol=chol,
    #                             dims=('cfmq_pid', 'task'))

    # Extract the scores for each task and participant
    cfmt_latent = latent_scores[cfmt_pid_idx, 0]
    cfmq_latent = latent_scores[cfmq_pid_idx, 1]

    # Cutpoints for the ordinal likelihood
    cutpoints = pm.Normal('cutpoints',
                          mu=np.linspace(-3, 3, 4), 
                          sigma=0.1,
                          transform=pm.distributions.transforms.ordered)

    # Likelihoods
    pm.Bernoulli('cfmt_obs', p=pm.math.invprobit(cfmt_latent), observed=cfmt['Response'].values, dims='cfmt_nobs')
    pm.OrderedProbit('cfmq_obs', eta=cfmq_latent, cutpoints=cutpoints, observed=cfmq['Response'].values, compute_p=False, dims='cfmq_nobs')

    # Sample
    idata = pm.sample(random_seed=rng)

az.summary(idata, var_names='corr')

tcapretto · February 11, 2025, 1:13am

In the first model you’re passing a correlation matrix to an argument that expects a covariance matrix. You’re thus putting an upper bound on variances and covariances (i.e. 1). In other words, you’re not fitting the same model in both cases.

A bit more of self-promotion, I have another blogpost where I work with both LKJCorr and LKJCholeskyCov. In the one with LKJCorr I show how to create the covariance matrix. I think it should be a bit easier now as you get the correlation matrix and not the triangular upper part. See: Hierarchical modeling with the LKJ prior in PyMC – Tomi Capretto

tcapretto · February 11, 2025, 1:25am

@alj here you have it implemented

import pytensor.tensor as pt

# Model block
with pm.Model(coords=coords) as hierarchy:
    corr = pm.LKJCorr('corr', n=2, eta=1, return_matrix=True)

    sigma_u = pm.HalfNormal("sigma_u", 3, shape=2)

    # Construct diagonal matrix of standard deviation
    sigma_diagonal = pm.Deterministic("sigma_diagonal", pt.eye(2) * sigma_u)

    # Compute covariance matrix
    Sigma = pt.nlinalg.matrix_dot(sigma_diagonal, corr, sigma_diagonal)

    latent_scores = pm.MvNormal(
        'latent_scores',
        mu=[0, 0],
        cov=Sigma,  
        dims=('cfmq_pid', 'task')
    )

    # Extract the scores for each task and participant
    cfmt_latent = latent_scores[cfmt_pid_idx, 0]
    cfmq_latent = latent_scores[cfmq_pid_idx, 1]

    # Cutpoints for the ordinal likelihood
    cutpoints = pm.Normal('cutpoints',
                          mu=np.linspace(-3, 3, 4),
                          sigma=0.1,
                          transform=pm.distributions.transforms.ordered)

    # Likelihoods
    pm.Bernoulli('cfmt_obs', p=pm.math.invprobit(cfmt_latent), observed=cfmt['Response'].values, dims='cfmt_nobs')
    pm.OrderedProbit('cfmq_obs', eta=cfmq_latent, cutpoints=cutpoints, observed=cfmq['Response'].values, compute_p=False, dims='cfmq_nobs')

    # Sample
    idata = pm.sample(random_seed=rng)

az.summary(idata, var_names='corr')

Keep in mind the note in pymc.LKJCorr — PyMC dev documentation:

This is mainly useful if you want the standard deviations to be fixed, as LKJCholsekyCov is optimized for the case where they come from a distribution.

In other words, it’s recommended to ues LKJCholeskyCov and pass chol to MvNormal

alj · February 11, 2025, 10:44pm

Thank you as ever @tcapretto. I saw your other blog post but the triangular upper parts were a bit confusing - I am familiar though with the approach you used above to generate the covariance. Thank you.

So at least that solves the implementation! Now I am concerned about whether the results are sensible, as such a high correlation has some really big implications, if its accurate. A good approach here I guess is to use the model to simulate some data as you suggest in your blog post, so for example by setting the correlation to a low value and seeing what the model makes of the data. This seems easier with the LKJCorr implementation (I am likely wrong!), so something a bit like this:

# Simulation model code, using LKJCorr
with pm.Model(coords=coords) as hierarchy2:

    # Set prior on sigma
    sigma_u = pm.HalfNormal("sigma_u", 3, dims='task')

    # Construct diagonal matrix of standard deviation
    sigma_diagonal = pm.Deterministic("sigma_diagonal", pt.eye(2) * sigma_u)

    # Prior on correlation
    corr = pm.LKJCorr('corr', n=2, eta=1, return_matrix=True) 

    # Compute covariance matrix via pytensor
    Σ = pt.nlinalg.matrix_dot(sigma_diagonal, corr, sigma_diagonal)
    
    latent_scores = pm.MvNormal('latent_scores', 
                                mu=[0, 0],
                                cov=Σ,
                                dims=('cfmq_pid', 'task'))
    
    # Extract the scores for each task and participant
    cfmt_latent = latent_scores[cfmt_pid_idx, 0]
    cfmq_latent = latent_scores[cfmq_pid_idx, 1]

    # Cutpoints for the ordinal likelihood
    cutpoints = pm.Normal('cutpoints',
                          mu=np.linspace(-3, 3, 4), 
                          sigma=0.1,
                          transform=pm.distributions.transforms.ordered)

    # Likelihoods - no observed data
    pm.Bernoulli('cfmt_obs', p=pm.math.invprobit(cfmt_latent), dims='cfmt_nobs')
    pm.OrderedProbit('cfmq_obs', eta=cfmq_latent, cutpoints=cutpoints, compute_p=False, dims='cfmq_nobs')

# Use do operator to set sigmas to values from original model fit,
# corr to zero,
# and cutpoints similar to those in the original model fit
zero_fit = pm.do(hierarchy2, 
                 {'corr': np.array([0]), 
                  'sigma_u': np.array([.75, 1.3]),
                  'cutpoints': np.array([-2, -0.6, 0.6, 1.8])
                 })

# Draw observed data on both tasks
cfmt_obs, cfmq_obs = pm.draw([zero_fit['cfmt_obs'], zero_fit['cfmq_obs']], random_seed=rng)

# Set them and do inference
md = pm.observe(hierarchy2, {'cfmt_obs': cfmt_obs, 'cfmq_obs': cfmq_obs})
with md:
    inference = pm.sample(random_seed=rng)

az.summary(inference, var_names='corr')

This yields a correlation of -0.094 [-0.26, 0.07], which seems perhaps a bit negatively biased, but I might have got something wrong. Either way, if there are any thoughts about the accuracy of the models original estimate or how to test it further with simulation, I would be super appreciative! Thanks so much again for the advice!

tcapretto · February 12, 2025, 2:19pm

You could explore more scenarios and perform more simulations for each scenario (perhaps between 20 and 50? the more the better, but I’m also considering computational costs).

Other scenarios could be based on:

Different correlation levels
Different number of observations

Let’s suppose you do 50 iterations. Then you can do (terribly simplified)

for n in SAMPLE_SIZES:
    for c in CORRELATIONS:
        for i in range(50):
            data = simulate_data(n, c)
            model = build_model(data)
            results = fit_model_and_compute_summaries(model)
            results_dict[c][n].append(results)

Once you have that, you could, for example, see credible intervals for the correlation for the different underlying correlation levels and sample sizes.

alj · February 14, 2025, 12:37pm

Back again @tcapretto - thank you for the advice. This seems sensible, akin to checking the error rates (good ol’ frequentism!) of the model.

I followed the approach above, having the model generate data for a fixed set of latent correlations - [0, .2, .4, .6, .8]. I did this 100 times per correlation and fit the model to that data to recover them. I didnt vary sample size at all because while this test case is for about 100 participants, the actual dataset has over 4,000, and larger samples obviously slow down the sampling - this simulation took around 8 hours on my machine!

The results are in this figure. I guess this looks good and is relatively encouraging? I did a simple check of whether the true correlation was contained within the upper and low credible limits, and it varies a little bit, but obviously this is only 100 simulations - {0: 93%, .2: 95%, .4: 90%, .6: 97%, .8: 97%}.

It seems to do pretty well especially when the correlations are higher, which is what is indicated when this model is estimated on the full dataset (which takes over 30 mins, hence the smaller sample). So I guess this seems a sensible structure, and the results are trustworthy? Any further thoughts are welcomed. I also tried the model by generating truly random data (e.g. just 1’s and 0’s for the one test, and random integers for the second within the range), and it told me the correlation was close to zero but this time with massive intervals, which I guess is due to the fact that sort of data has no group-specific variability going on.

I’m reassured but still circumspect, any further thoughts welcome!

alj · March 12, 2025, 2:19pm

Hey @tcapretto, I wanted to circle back to this after thinking on it for a bit longer. I realised there was indeed a mis-specification in the model above that was inflating the latent correlations. I figured this out in the end by permuting one of the two datasets - basically just rearranging where each participant was in the dataframe, so the indexers were pulling in different data for each row of the latent variable.

Unfortunately that also gave me really high correlations too, which really rang alarm bells. I also got a new set of data for which the sum-scored tasks had basically zero correlation, so it would be surprising to see large latent correlation - which I did!

The solution in the end was simple - I allowed the mu of the MvNormal to be estimated rather than fixed to zero, e.g.

latent_scores = pm.MvNormal('latent_scores',
                            mu=pm.Normal('latent_mu', dims=('task')),
                            cov=Sigma,  
                            dims=('cfmq_pid', 'task'))

I guess because the latent variables varied in their means, restricting them to zero inflated the correlation as the values had to accommodate the difference somehow?

Either way, it now works really nicely and produces sensible latent correlations - which also go away if things are permuted!

Wanted to leave this here in case anyone else makes this sneaky mistake. For some reason I assumed that when estimating these latent scores we should fix the mean or variance to a constant, but this doesn’t seem to be so in this case.

Thanks for the help!

tcapretto · March 31, 2025, 7:41pm

I just saw this. Glad you were able to figure it out. Very interesting work! If this gets published somewhere, please share it here. It’s nice to see not-so-common models like this in PyMC.

bob-carpenter · April 1, 2025, 8:55pm

That’s exactly what happens. And this is why these models can be hard to fit, especially with wider tails. You have two explanations of a large value—either a bigger mean or higher variance. Both improve the probability of the “outlier”.

It’s similar. You can make this more rigorous and do a full round of simulation-based calibration (SBC) checks. SBC is a formalized coverage check.

The other place where you see analogues of frequentist inference is in model checking. In a frequentist regression, one often does a chi-square goodness of fit test taking the fitted model as a null. In Bayes land, the analogue is a posterior predictive check. What you want is that if you take alpha-intervals (e.g., alpha = 50%), then it has alpha coverage. Specifically, if there are n parameters, you can do a hypothesis test that binomial(n, 0.5) of them fall into their 50% intervals—values that are too high or too low are red flags about the fit. You can directly port your favorite chi-square goodness of fit tests or derive new ones.

Wherever possible, stick to the Cholesky factors. They are \mathcal{O}(D^2) rather than \mathcal{O}(D^3) to solve (e.g., to evaluate a multivariate normal log density).

alj · April 2, 2025, 11:52am

Will do, its the goal to get it out there at some point. Thank you again.

alj · April 2, 2025, 11:59am

Thanks for taking a look here @bob-carpenter, I appreciate it. I hadn’t come across the binomial test of the PPC, that’s great!

With regards to the Cholesky factors, I am happy to stick with them when estimating models and know they are computationally more efficient (did not realise by that factor though!). My math is not super strong, so in the case of simulating data from a model with this sort of structure, it seemed far easier to specify the correlation directly with pm.LKJCorr and recover the covariance. There’s surely a way to work from Cholesky factors straight to a correlation/covariance matrix when simulating data too, right? From a workflow perspective I am thinking of the ease of using the .do and .observe operators to create a model and then push through different correlations to examine the output, which for me was obviously easier with the LKJCorr prior. Thanks again!

bob-carpenter · April 2, 2025, 4:35pm

It’s actually the standard way to do it. If I want to generate

\qquad y \sim \textrm{normal}(\mu, \Sigma)

for covariance matrix \Sigma and location \mu, then the way to do it is to simulate standard normal variates independently,

\qquad z \sim \textrm{normal}(0, \textrm{I}),

then set

\qquad y = \mu + L_\Sigma \cdot z,

where L_\Sigma is the cholesky factor for \Sigma.

alj · April 3, 2025, 11:09am

Great, that makes sense, thank you @bob-carpenter. So the workflow for simulating would go something like:

Set desired correlation matrix, and sigmas of each column.
Convert correlation matrix to covariance matrix
Get the Cholesky factor of the covariance matrix, and
add the mean vector to the dot product of the Cholesky factor and a standard normal distribution of the desired outcome shape

Then it should be easy enough to set correlations as is needed and simulate/estimate more quickly, I guess! Thank you!

bob-carpenter · April 3, 2025, 5:03pm

Yes, that’s the procedure.

ricardoV94 · April 5, 2025, 9:44am

pm.MvNormal accepts chol factor directly, you don’t need step 4.

Topic		Replies	Views
Cholesky decomposition and correlation among random effects Questions	20	5513	March 11, 2019
Correlated slopes in multivariate model Questions	9	3446	July 14, 2018
Using LKJCorr together with MvNormal version agnostic	44	717	January 16, 2024
Discrete choice with correlated random parameters v5 modeling	11	91	July 17, 2025
Lkjcorr returning error	26	662	December 14, 2023

Difficulty estimating latent correlations

Related topics