Hello PyMC community!
I am having some difficulty estimating the correlations between latent scores, or I guess equivalently random effects. Let me set the stage a bit to describe the problem:
I have two behavioural tasks that a group of participants respond to. One task is scored as being correct/incorrect, so Bernoulli distributed, while another is the level of endorsement the participant has to a set of questions, an ordinal Likert scale. In my field, the typical approach is to sum the scores on each task, per participant, and then correlate them. Typically, this results in relatively weak associations, and some folks think this could be due to trial-level variation being ignored by the sum score approach (work by Jeffrey Rouder for example who may be known to readers).
Enter the hierarchical model! My aim was to use Bayesian inference to estimate a set of latent scores - two per participant, one on each task - that are correlated, coming from a multivariate normal. These are then used as scores that plug in to each participants’ set of data in each task, which are evaluated under their appropriate likelihood (a Bernoulli for the one task, and an OrderedProbit with cutpoints for the other).
Using this architecture I’ve successfully estimated a model and the latent correlation, but it rings alarm bells as almost doubles in size (from about .40 to .85!), which has serious ramifications for practice in the area! Literally the definition of ‘huge if true’.
I have two problems, one theoretical and one technical. The first is that this seems too good to be true. I’ve tested the approach on data where the correlation is theoretically expected but quite weak, and it goes up a lot in this approach. The trouble is this approach also bumps up the correlation between two tasks that have a theoretically-sensibly low (sum-score) correlation, from about .15 to .6. I really need a sanity check as to whether the structure is correct. I’ve followed @tcapretto’s excellent blog on simulating data with PyMC using do
and observe
and the structure checks out, recovering any latent correlation I give it from the raw data.
The second problem is I am not sure which parameterisation to use. I get different results with pm.LKJCorr
and pm.CholeskyCov
. Unfortunately I am at my limit with the mathematics there and when to pick one over the other - I know there have been changes to both distributions that make them a bit more user friendly in recent updates, but I worry this is the source of my issues.
A fully reproducible example is below using a slice of data from the full set - any thoughts/comments are massively appreciated before I make a fool of myself at a conference in the near future. PyMCheers!
import arviz as az
import pandas as pd
import numpy as np
import pymc as pm
import scipy.stats as st
rng = np.random.default_rng(
sum(map(ord, 'estimating latent correlations is tricky!'))
)
# Read in data
df = pd.read_csv('https://raw.githubusercontent.com/alexjonesphd/py4psy2024/refs/heads/main/test_data.csv', dtype={'Response': int})
# Estimating a simple sum-score correlation
simple_corrs = (df
.groupby(['Participant_ID', 'TrialType'], as_index=False)
.agg(score=('Response', 'sum'))
.pivot(index='Participant_ID', columns='TrialType', values='score')
.apply(st.zscore, ddof=1)
)
simple_corrs.corr() # ~ .41 approx
# A hierarchical model with two likelihoods
# Data and coords etc
cfmq = df.query('TrialType == "cfmq"').reset_index(drop=True)
cfmt = df.query('TrialType == "cfmt"').reset_index(drop=True)
# Pull out indexes and labels - the labels will be identical between tasks, the indexers will differ
cfmq_pid_idx, cfmq_pid_label = cfmq['Participant_ID'].factorize()
cfmt_pid_idx, cfmt_pid_label = cfmt['Participant_ID'].factorize()
# Set coordinates
coords = {'cfmq_pid': cfmq_pid_label,
'cfmt_pid': cfmt_pid_label,
'cfmt_nobs': cfmt.index,
'cfmq_nobs': cfmq.index,
'task': ['cfmt', 'cfmq']}
# Model block
with pm.Model(coords=coords) as hierarchy:
# Correlation - not sure which of these two approaches is correct
corr = pm.LKJCorr('corr', n=2, eta=1, return_matrix=True) # this one suggests a correlation of about .78
latent_scores = pm.MvNormal('latent_scores',
mu=[0, 0],
cov=corr,
dims=('cfmq_pid', 'task'))
# # # Correlation - this approach suggests about .85
# chol, corr, sigma = pm.LKJCholeskyCov('cmat',
# eta=1,
# n=2,
# sd_dist=pm.HalfNormal.dist(3),
# compute_corr=True,
# store_in_trace=True)
# latent_scores = pm.MvNormal('latent_scores',
# mu=[0, 0],
# chol=chol,
# dims=('cfmq_pid', 'task'))
# Extract the scores for each task and participant
cfmt_latent = latent_scores[cfmt_pid_idx, 0]
cfmq_latent = latent_scores[cfmq_pid_idx, 1]
# Cutpoints for the ordinal likelihood
cutpoints = pm.Normal('cutpoints',
mu=np.linspace(-3, 3, 4),
sigma=0.1,
transform=pm.distributions.transforms.ordered)
# Likelihoods
pm.Bernoulli('cfmt_obs', p=pm.math.invprobit(cfmt_latent), observed=cfmt['Response'].values, dims='cfmt_nobs')
pm.OrderedProbit('cfmq_obs', eta=cfmq_latent, cutpoints=cutpoints, observed=cfmq['Response'].values, compute_p=False, dims='cfmq_nobs')
# Sample
idata = pm.sample(random_seed=rng)
az.summary(idata, var_names='corr')