I’ve been playing around with a model to estimate the causal effect of a binary intervention on a binary outcome. There is also an unobserved confound causing both the “exposure” and the “outcome”. The idea is to use instrumental variable analysis. However, the sampling is a disaster and the model is not even close to retrieving the parameter of interest.
The code is based on this blog post
import pymc as pm
import numpy as np
import pandas as pd
import arviz as az
import pytensor.tensor as pt
from scipy.special import expit as invlogit
rng = np.random.default_rng()
sample_size = 1000
I = rng.binomial(p=0.8, n=1, size=sample_size)
U = rng.normal(loc=0, scale=1, size=sample_size)
X = rng.binomial(p=invlogit(-1 - (3 * (1 - I)) + 0.6 * U), n=1, size=sample_size)
Y = rng.binomial(p=invlogit(-1 - 0.5 * X + 0.9 * U), n=1, size=sample_size)
dat_sim = pd.DataFrame({
'X': X,
'I': I,
'Y': Y,
'U': U
})
with pm.Model() as iv_model_bin:
intercept_y = pm.Normal('intercept_y', mu=0, sigma=1)
intercept_t = pm.Normal('intercept_t', mu=0, sigma=1)
beta_t = pm.Normal('beta_t', mu=0, sigma=1)
beta_z = pm.Normal('beta_z', 0, 1)
sd_dist = pm.HalfCauchy.dist(beta=5, shape=2)
chol, _, _ = pm.LKJCholeskyCov('chol_cov', eta=2, n=2, sd_dist=sd_dist)
mu_y = pm.Deterministic('mu_y', beta_t * dat_sim.X + intercept_y)
mu_t = pm.Deterministic('mu_t', beta_z * dat_sim.I + intercept_t)
mu = pm.Deterministic('mu', pt.stack(tensors=([mu_y, mu_t]), axis=1))
mv = pm.MvNormal('mv', mu=mu, chol=chol, shape=(sample_size, 2))
likelihood = pm.Bernoulli('likelihood', logit_p=mv, observed=pt.stack(([dat_sim.Y, dat_sim.X]), axis=1), shape=(sample_size,2))
idata_iv_model_bin = pm.sample()
az.summary(idata_iv_model_bin, var_names=['intercept_y', 'intercept_t', 'beta_t', 'beta_z'])
My suspicions are:
- The model code/specification is causing the issues. I’m hoping that this is the issue. I’ve tried another version where I have two separate observed likelihoods, one for Y and one for I, but the results are roughly the same.
- Something about the data generating process is causing instrumental variable analysis to fail here. (I’ve not yet found anything in the literature suggesting that IV will cause biased estimates in this specific situation.)
- The priors are off. (I find it hard to wrap my mind around the priors here honestly.)
- There is something fundamentally wrong with me as a person. (Likely true.)
Why is my model a disaster? Any pointers, ideas or suggestions to help me figure this out will be greatly appreciated.