Interim update.
I think that the issue lies within the GLM portion of the modeling. If the X and Y are continuous (normally distributed) with a binary instrument, I can retrieve the DGP easily.
import pymc as pm
import numpy as np
import pandas as pd
import arviz as az
import pytensor.tensor as pt
rng = np.random.default_rng()
sample_size = 1000
I = rng.binomial(p=0.8, n=1, size=sample_size)
U = rng.normal(loc=0, scale=1, size=sample_size)
X_cont = rng.normal(loc=-1 - (3 * (1 - I)) + 0.6 * U, scale=1, size=sample_size)
Y_cont = rng.normal(loc=-1 - 0.5 * X_cont + 0.9 * U, scale=1, size=sample_size)
dat_sim = pd.DataFrame({
'X_cont': X_cont,
'I': I,
'Y_cont': Y_cont,
'U': U
})
with pm.Model() as iv_model_cont:
intercept_y = pm.Normal('intercept_y', 0, .5)
intercept_t = pm.Normal('intercept_t', 0, .5)
beta_t = pm.Normal('beta_t', 0, 1)
beta_z = pm.Normal('beta_z', 0, 1)
sd_dist = pm.HalfCauchy.dist(beta=5)
chol, _, _ = pm.LKJCholeskyCov('chol_cov', eta=2, n=2, sd_dist=sd_dist)
mu_y = pm.Deterministic('mu_y', beta_t * dat_sim.X_cont + intercept_y)
mu_t = pm.Deterministic('mu_t', beta_z * dat_sim.I + intercept_t)
mu = pm.Deterministic('mu', pt.stack(tensors=(mu_y, mu_t), axis=1))
mv = pm.MvNormal('mv', mu=mu, chol=chol, observed=np.stack((dat_sim.Y_cont, dat_sim.X_cont), axis=1), shape=(sample_size, 2))
idata_iv_model_cont = pm.sample()
az.summary(idata_iv_model_cont, var_names=['intercept_y', 'intercept_t', 'beta_t', 'beta_z'])
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept_y -0.965 0.070 -1.107 -0.845 0.002 0.001 1875.0 2205.0 1.0
intercept_t -3.961 0.084 -4.116 -3.804 0.002 0.001 1887.0 2449.0 1.0
beta_t -0.488 0.035 -0.553 -0.424 0.001 0.001 1679.0 2159.0 1.0
beta_z 2.979 0.094 2.803 3.154 0.002 0.002 1744.0 2218.0 1.0
In the binary case, the ivtools package is able to retrieve the DGP, both using 2sls and the g-estimation approach, when the sample size is large (reliably with sample sizes of around 20,000). I’m a bit confused as to why the 2sls approach from linearmodels and my pymc based attempt at 2sls does not work:
with pm.Model() as iv_2sls:
intercept_y = pm.Normal('intercept_y', 0, .5)
intercept_t = pm.Normal('intercept_t', 0, .5)
beta_t = pm.Normal('beta_t', 0, 1)
beta_z = pm.Normal('beta_z', 0, 1)
mu_t = pm.Deterministic('mu_t', beta_z * dat_sim['I'] + intercept_t)
t = pm.Bernoulli('likelihood_t', logit_p=mu_t, observed=dat_sim['X'], shape=(sample_size,))
mu_y = pm.Deterministic('mu_y', beta_t * t + intercept_y)
y = pm.Bernoulli('likelihood_y', logit_p=mu_y, observed=dat_sim['Y'], shape=(sample_size,))
idata_iv_2sls = pm.sample()
az.summary(idata_iv_2sls, var_names=['intercept_y', 'intercept_t', 'beta_t', 'beta_z'])
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept_y -0.955 0.025 -1.003 -0.907 0.001 0.000 2141.0 2389.0 1.0
intercept_t -3.553 0.131 -3.788 -3.298 0.003 0.002 1500.0 1954.0 1.0
beta_t 0.007 0.053 -0.093 0.104 0.001 0.001 2438.0 2201.0 1.0
beta_z 2.631 0.133 2.383 2.880 0.003 0.002 1564.0 1902.0 1.0
I guess I have to go dumpster diving in the code for the R package. In the meanwhile, if anyone knows what’s going on, please let me know. I would love to use pymc for this project rather than some frequentist R package.