Instrumental variable analysis with binary outcome

Interim update.

I think that the issue lies within the GLM portion of the modeling. If the X and Y are continuous (normally distributed) with a binary instrument, I can retrieve the DGP easily.

import pymc as pm
import numpy as np
import pandas as pd
import arviz as az
import pytensor.tensor as pt

rng = np.random.default_rng()

sample_size = 1000

I = rng.binomial(p=0.8, n=1, size=sample_size)
U = rng.normal(loc=0, scale=1, size=sample_size)
X_cont = rng.normal(loc=-1 - (3 * (1 - I)) + 0.6 * U, scale=1, size=sample_size)
Y_cont = rng.normal(loc=-1 - 0.5 * X_cont + 0.9 * U, scale=1, size=sample_size)

dat_sim = pd.DataFrame({
    'X_cont': X_cont,
    'I': I,
    'Y_cont': Y_cont,
    'U': U
})

with pm.Model() as iv_model_cont:
     intercept_y = pm.Normal('intercept_y', 0, .5)
     intercept_t = pm.Normal('intercept_t', 0, .5)
     beta_t = pm.Normal('beta_t', 0, 1)
     beta_z = pm.Normal('beta_z', 0, 1)
     sd_dist = pm.HalfCauchy.dist(beta=5)
     chol, _, _ = pm.LKJCholeskyCov('chol_cov', eta=2, n=2, sd_dist=sd_dist)

     mu_y = pm.Deterministic('mu_y', beta_t * dat_sim.X_cont + intercept_y)
     mu_t = pm.Deterministic('mu_t', beta_z * dat_sim.I + intercept_t)
     mu = pm.Deterministic('mu', pt.stack(tensors=(mu_y, mu_t), axis=1))    

     mv = pm.MvNormal('mv', mu=mu, chol=chol, observed=np.stack((dat_sim.Y_cont, dat_sim.X_cont), axis=1), shape=(sample_size, 2))

     idata_iv_model_cont = pm.sample()

az.summary(idata_iv_model_cont, var_names=['intercept_y', 'intercept_t', 'beta_t', 'beta_z'])


mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
intercept_y	-0.965	0.070	-1.107	-0.845	0.002	0.001	1875.0	2205.0	1.0
intercept_t	-3.961	0.084	-4.116	-3.804	0.002	0.001	1887.0	2449.0	1.0
beta_t	    -0.488	0.035	-0.553	-0.424	0.001	0.001	1679.0	2159.0	1.0
beta_z	     2.979	0.094	2.803	3.154	0.002	0.002	1744.0	2218.0	1.0

In the binary case, the ivtools package is able to retrieve the DGP, both using 2sls and the g-estimation approach, when the sample size is large (reliably with sample sizes of around 20,000). I’m a bit confused as to why the 2sls approach from linearmodels and my pymc based attempt at 2sls does not work:

with pm.Model() as iv_2sls:
     intercept_y = pm.Normal('intercept_y', 0, .5)
     intercept_t = pm.Normal('intercept_t', 0, .5)
     beta_t = pm.Normal('beta_t', 0, 1)
     beta_z = pm.Normal('beta_z', 0, 1)
     
     mu_t = pm.Deterministic('mu_t', beta_z * dat_sim['I'] + intercept_t)
     t = pm.Bernoulli('likelihood_t', logit_p=mu_t, observed=dat_sim['X'], shape=(sample_size,))

     mu_y = pm.Deterministic('mu_y', beta_t * t + intercept_y)
     y = pm.Bernoulli('likelihood_y', logit_p=mu_y, observed=dat_sim['Y'], shape=(sample_size,))

     idata_iv_2sls = pm.sample()

az.summary(idata_iv_2sls, var_names=['intercept_y', 'intercept_t', 'beta_t', 'beta_z'])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
intercept_y	-0.955	0.025	-1.003	-0.907	0.001	0.000	2141.0	2389.0	1.0
intercept_t	-3.553	0.131	-3.788	-3.298	0.003	0.002	1500.0	1954.0	1.0
beta_t  	 0.007	0.053	-0.093	0.104	0.001	0.001	2438.0	2201.0	1.0
beta_z	     2.631	0.133	2.383	2.880	0.003	0.002	1564.0	1902.0	1.0

I guess I have to go dumpster diving in the code for the R package. In the meanwhile, if anyone knows what’s going on, please let me know. I would love to use pymc for this project rather than some frequentist R package.

1 Like