Difficulty estimating latent correlations

Hello PyMC community!

I am having some difficulty estimating the correlations between latent scores, or I guess equivalently random effects. Let me set the stage a bit to describe the problem:

I have two behavioural tasks that a group of participants respond to. One task is scored as being correct/incorrect, so Bernoulli distributed, while another is the level of endorsement the participant has to a set of questions, an ordinal Likert scale. In my field, the typical approach is to sum the scores on each task, per participant, and then correlate them. Typically, this results in relatively weak associations, and some folks think this could be due to trial-level variation being ignored by the sum score approach (work by Jeffrey Rouder for example who may be known to readers).

Enter the hierarchical model! My aim was to use Bayesian inference to estimate a set of latent scores - two per participant, one on each task - that are correlated, coming from a multivariate normal. These are then used as scores that plug in to each participants’ set of data in each task, which are evaluated under their appropriate likelihood (a Bernoulli for the one task, and an OrderedProbit with cutpoints for the other).

Using this architecture I’ve successfully estimated a model and the latent correlation, but it rings alarm bells as almost doubles in size (from about .40 to .85!), which has serious ramifications for practice in the area! Literally the definition of ‘huge if true’.

I have two problems, one theoretical and one technical. The first is that this seems too good to be true. I’ve tested the approach on data where the correlation is theoretically expected but quite weak, and it goes up a lot in this approach. The trouble is this approach also bumps up the correlation between two tasks that have a theoretically-sensibly low (sum-score) correlation, from about .15 to .6. I really need a sanity check as to whether the structure is correct. I’ve followed @tcapretto’s excellent blog on simulating data with PyMC using do and observe and the structure checks out, recovering any latent correlation I give it from the raw data.

The second problem is I am not sure which parameterisation to use. I get different results with pm.LKJCorr and pm.CholeskyCov. Unfortunately I am at my limit with the mathematics there and when to pick one over the other - I know there have been changes to both distributions that make them a bit more user friendly in recent updates, but I worry this is the source of my issues.

A fully reproducible example is below using a slice of data from the full set - any thoughts/comments are massively appreciated before I make a fool of myself at a conference in the near future. PyMCheers!

import arviz as az
import pandas as pd
import numpy as np
import pymc as pm
import scipy.stats as st

rng = np.random.default_rng(
    sum(map(ord, 'estimating latent correlations is tricky!'))
)

# Read in data
df = pd.read_csv('https://raw.githubusercontent.com/alexjonesphd/py4psy2024/refs/heads/main/test_data.csv', dtype={'Response': int})

# Estimating a simple sum-score correlation
simple_corrs = (df
                .groupby(['Participant_ID', 'TrialType'], as_index=False)
                .agg(score=('Response', 'sum'))
                .pivot(index='Participant_ID', columns='TrialType', values='score')
                .apply(st.zscore, ddof=1)
               )

simple_corrs.corr() # ~ .41 approx

# A hierarchical model with two likelihoods 
# Data and coords etc
cfmq = df.query('TrialType == "cfmq"').reset_index(drop=True)
cfmt = df.query('TrialType == "cfmt"').reset_index(drop=True)

# Pull out indexes and labels - the labels will be identical between tasks, the indexers will differ
cfmq_pid_idx, cfmq_pid_label = cfmq['Participant_ID'].factorize()
cfmt_pid_idx, cfmt_pid_label = cfmt['Participant_ID'].factorize()

# Set coordinates
coords = {'cfmq_pid': cfmq_pid_label, 
          'cfmt_pid': cfmt_pid_label,
          'cfmt_nobs': cfmt.index,
          'cfmq_nobs': cfmq.index,
          'task': ['cfmt', 'cfmq']}

# Model block
with pm.Model(coords=coords) as hierarchy:

    # Correlation - not sure which of these two approaches is correct 
    corr = pm.LKJCorr('corr', n=2, eta=1, return_matrix=True) # this one suggests a correlation of about .78
    latent_scores = pm.MvNormal('latent_scores', 
                                mu=[0, 0],
                                cov=corr,
                                dims=('cfmq_pid', 'task'))
    
    # # # Correlation - this approach suggests about .85
    # chol, corr, sigma = pm.LKJCholeskyCov('cmat',
    #                                       eta=1, 
    #                                       n=2,
    #                                       sd_dist=pm.HalfNormal.dist(3),
    #                                       compute_corr=True, 
    #                                       store_in_trace=True)
    # latent_scores = pm.MvNormal('latent_scores', 
    #                             mu=[0, 0],
    #                             chol=chol,
    #                             dims=('cfmq_pid', 'task'))

    # Extract the scores for each task and participant
    cfmt_latent = latent_scores[cfmt_pid_idx, 0]
    cfmq_latent = latent_scores[cfmq_pid_idx, 1]

    # Cutpoints for the ordinal likelihood
    cutpoints = pm.Normal('cutpoints',
                          mu=np.linspace(-3, 3, 4), 
                          sigma=0.1,
                          transform=pm.distributions.transforms.ordered)

    # Likelihoods
    pm.Bernoulli('cfmt_obs', p=pm.math.invprobit(cfmt_latent), observed=cfmt['Response'].values, dims='cfmt_nobs')
    pm.OrderedProbit('cfmq_obs', eta=cfmq_latent, cutpoints=cutpoints, observed=cfmq['Response'].values, compute_p=False, dims='cfmq_nobs')

    # Sample
    idata = pm.sample(random_seed=rng)

az.summary(idata, var_names='corr')

In the first model you’re passing a correlation matrix to an argument that expects a covariance matrix. You’re thus putting an upper bound on variances and covariances (i.e. 1). In other words, you’re not fitting the same model in both cases.

A bit more of self-promotion, I have another blogpost where I work with both LKJCorr and LKJCholeskyCov. In the one with LKJCorr I show how to create the covariance matrix. I think it should be a bit easier now as you get the correlation matrix and not the triangular upper part. See: Hierarchical modeling with the LKJ prior in PyMC – Tomi Capretto

@alj here you have it implemented

import pytensor.tensor as pt

# Model block
with pm.Model(coords=coords) as hierarchy:
    corr = pm.LKJCorr('corr', n=2, eta=1, return_matrix=True)

    sigma_u = pm.HalfNormal("sigma_u", 3, shape=2)

    # Construct diagonal matrix of standard deviation
    sigma_diagonal = pm.Deterministic("sigma_diagonal", pt.eye(2) * sigma_u)

    # Compute covariance matrix
    Sigma = pt.nlinalg.matrix_dot(sigma_diagonal, corr, sigma_diagonal)

    latent_scores = pm.MvNormal(
        'latent_scores',
        mu=[0, 0],
        cov=Sigma,  
        dims=('cfmq_pid', 'task')
    )

    # Extract the scores for each task and participant
    cfmt_latent = latent_scores[cfmt_pid_idx, 0]
    cfmq_latent = latent_scores[cfmq_pid_idx, 1]

    # Cutpoints for the ordinal likelihood
    cutpoints = pm.Normal('cutpoints',
                          mu=np.linspace(-3, 3, 4),
                          sigma=0.1,
                          transform=pm.distributions.transforms.ordered)

    # Likelihoods
    pm.Bernoulli('cfmt_obs', p=pm.math.invprobit(cfmt_latent), observed=cfmt['Response'].values, dims='cfmt_nobs')
    pm.OrderedProbit('cfmq_obs', eta=cfmq_latent, cutpoints=cutpoints, observed=cfmq['Response'].values, compute_p=False, dims='cfmq_nobs')

    # Sample
    idata = pm.sample(random_seed=rng)

az.summary(idata, var_names='corr')

Keep in mind the note in pymc.LKJCorr — PyMC dev documentation:

This is mainly useful if you want the standard deviations to be fixed, as LKJCholsekyCov is optimized for the case where they come from a distribution.

In other words, it’s recommended to ues LKJCholeskyCov and pass chol to MvNormal

Thank you as ever @tcapretto. I saw your other blog post but the triangular upper parts were a bit confusing - I am familiar though with the approach you used above to generate the covariance. Thank you.

So at least that solves the implementation! Now I am concerned about whether the results are sensible, as such a high correlation has some really big implications, if its accurate. A good approach here I guess is to use the model to simulate some data as you suggest in your blog post, so for example by setting the correlation to a low value and seeing what the model makes of the data. This seems easier with the LKJCorr implementation (I am likely wrong!), so something a bit like this:

# Simulation model code, using LKJCorr
with pm.Model(coords=coords) as hierarchy2:

    # Set prior on sigma
    sigma_u = pm.HalfNormal("sigma_u", 3, dims='task')

    # Construct diagonal matrix of standard deviation
    sigma_diagonal = pm.Deterministic("sigma_diagonal", pt.eye(2) * sigma_u)

    # Prior on correlation
    corr = pm.LKJCorr('corr', n=2, eta=1, return_matrix=True) 

    # Compute covariance matrix via pytensor
    Σ = pt.nlinalg.matrix_dot(sigma_diagonal, corr, sigma_diagonal)
    
    latent_scores = pm.MvNormal('latent_scores', 
                                mu=[0, 0],
                                cov=Σ,
                                dims=('cfmq_pid', 'task'))
    
    # Extract the scores for each task and participant
    cfmt_latent = latent_scores[cfmt_pid_idx, 0]
    cfmq_latent = latent_scores[cfmq_pid_idx, 1]

    # Cutpoints for the ordinal likelihood
    cutpoints = pm.Normal('cutpoints',
                          mu=np.linspace(-3, 3, 4), 
                          sigma=0.1,
                          transform=pm.distributions.transforms.ordered)

    # Likelihoods - no observed data
    pm.Bernoulli('cfmt_obs', p=pm.math.invprobit(cfmt_latent), dims='cfmt_nobs')
    pm.OrderedProbit('cfmq_obs', eta=cfmq_latent, cutpoints=cutpoints, compute_p=False, dims='cfmq_nobs')

# Use do operator to set sigmas to values from original model fit,
# corr to zero,
# and cutpoints similar to those in the original model fit
zero_fit = pm.do(hierarchy2, 
                 {'corr': np.array([0]), 
                  'sigma_u': np.array([.75, 1.3]),
                  'cutpoints': np.array([-2, -0.6, 0.6, 1.8])
                 })

# Draw observed data on both tasks
cfmt_obs, cfmq_obs = pm.draw([zero_fit['cfmt_obs'], zero_fit['cfmq_obs']], random_seed=rng)

# Set them and do inference
md = pm.observe(hierarchy2, {'cfmt_obs': cfmt_obs, 'cfmq_obs': cfmq_obs})
with md:
    inference = pm.sample(random_seed=rng)

az.summary(inference, var_names='corr')

This yields a correlation of -0.094 [-0.26, 0.07], which seems perhaps a bit negatively biased, but I might have got something wrong. Either way, if there are any thoughts about the accuracy of the models original estimate or how to test it further with simulation, I would be super appreciative! Thanks so much again for the advice!

You could explore more scenarios and perform more simulations for each scenario (perhaps between 20 and 50? the more the better, but I’m also considering computational costs).

Other scenarios could be based on:

  • Different correlation levels
  • Different number of observations

Let’s suppose you do 50 iterations. Then you can do (terribly simplified)

for n in SAMPLE_SIZES:
    for c in CORRELATIONS:
        for i in range(50):
            data = simulate_data(n, c)
            model = build_model(data)
            results = fit_model_and_compute_summaries(model)
            results_dict[c][n].append(results)

Once you have that, you could, for example, see credible intervals for the correlation for the different underlying correlation levels and sample sizes.

Back again @tcapretto - thank you for the advice. This seems sensible, akin to checking the error rates (good ol’ frequentism!) of the model.

I followed the approach above, having the model generate data for a fixed set of latent correlations - [0, .2, .4, .6, .8]. I did this 100 times per correlation and fit the model to that data to recover them. I didnt vary sample size at all because while this test case is for about 100 participants, the actual dataset has over 4,000, and larger samples obviously slow down the sampling - this simulation took around 8 hours on my machine!

The results are in this figure. I guess this looks good and is relatively encouraging? I did a simple check of whether the true correlation was contained within the upper and low credible limits, and it varies a little bit, but obviously this is only 100 simulations - {0: 93%, .2: 95%, .4: 90%, .6: 97%, .8: 97%}.

It seems to do pretty well especially when the correlations are higher, which is what is indicated when this model is estimated on the full dataset (which takes over 30 mins, hence the smaller sample). So I guess this seems a sensible structure, and the results are trustworthy? Any further thoughts are welcomed. I also tried the model by generating truly random data (e.g. just 1’s and 0’s for the one test, and random integers for the second within the range), and it told me the correlation was close to zero but this time with massive intervals, which I guess is due to the fact that sort of data has no group-specific variability going on.

I’m reassured but still circumspect, any further thoughts welcome!