Hi PyMC community,
I have a perhaps quite a naive question I’m looking for clarification on. I’ve been reading into posterior predictive simulation and all the great applications its capable of, across the PyMC labs blogs and some of the thoughtful responses on the discourse, and want to make sure I have the concepts clear.
Say I have two variables that represent an objective and a subjective measure of performance on a task. They’re correlated imperfectly at around .5; but reach the significance threshold (so the field takes the relationship as a “given”!), and so there’s a broad perception that if individuals score high on one task, they score high on another. The objective task is slow and costly to administer, and your result on the task puts you into groups based on some thresholds. To get around this, its been proposed that people have insight into their ability on the task, so we can just ask them what they think they’d get - cheaper, faster, and “good enough”.
I’m thinking of leaning on posterior predictive simulation to draw attention to the implications of such a correlation at the level of the individual and why we should be more careful in our conclusions, but I want to make sure I’ve understood it correctly. Lets say that if you have a subjective score of -2, which would be a cutoff of interest on the objective task. Can I use the posterior predictive distribution to make a probabilistic claim about how likely that score is on the objective task, given a model and some data? Some code that matches by use case is below which hopefully makes my rambling clear:
import arviz as az
import numpy as np
import pymc as pm
rng = np.random.default_rng(35)
# Draw data
n = 200
subjective = pm.draw(pm.Normal.dist(), n, random_seed=rng)
objective = pm.draw(pm.Normal.dist(mu=subjective*0.52, sigma=0.83), random_seed=rng)
# Model
with pm.Model() as test:
# Data
X = pm.MutableData('X', subjective)
Y = pm.MutableData('Y', objective)
# Parameters
β0 = pm.Normal('β0', mu=0, sigma=1)
β1 = pm.Normal('β1', mu=0, sigma=1)
σ = pm.HalfNormal('σ', sigma=1)
pm.Normal('y',
mu=β0 + β1*X,
sigma=σ,
observed=Y)
idata = pm.sample(draws=500, tune=1000)
# Set new data, draw posterior predictive
with test:
pm.set_data({"X": [-2], "Y": [0]})
pm.sample_posterior_predictive(idata, random_seed=rng, predictions=True, extend_inferencedata=True)
# Plot the prediction, highlight -2 probability
az.plot_posterior(idata, group='predictions', var_names='y', ref_val=-2)
I guess my question is really - what is the correct interpretation of this prediction? Is it given the model and the posterior, a likely future, unobserved objective score for a person with -2 on the subjective is on average -0.93, ranges from -2.5 to 0.64; and the probability of it being -2 or less is about 10%?
I’ve seen some examples where whole datasets of varying sizes are generated from the posterior predictive sample, but this doesn’t seem applicable here, only about a single new datapoint and its possible future values. Am I correct in this or way off?
Tagging @tcapretto, @ricardoV94, @AlexAndorra, and @jessegrabowski who have written some great stuff on posterior predictive sampling in the past.
Thanks so much for reading!