Hi,
I am working on an randomized control trial model to determine the estimates for trial/control and a distribution (trial/control email sent on same date). I would expect the distribution beta estimates to be 0 instead of ~0.1, as the testing data controls for the distribution effect. It seems that some of the explanatory power of the trial/control betas is leaking into the distribution. I know this is caused by partial pooling between the trial/control and distribution betas (e.g. trial and control group being used a single distribution results in some of the trial effect leaking into the distribution estimate).
Is there a recommended approach to modeling that would avoid this leakage?
See below for the testing data, model code, and trace summary. Thank you for any feedback!
Data Example:
trial_control distribution sent opens
0 0 1 100 20
1 1 1 500 150
2 0 2 1000 200
3 1 2 5000 1500
4 0 3 10000 2000
5 1 3 50000 15000
# Model code
data = pd.DataFrame({'trial_control': [0, 1, 0, 1, 0, 1],
'distribution': [0, 0, 1, 1, 2, 2],
'sent': [100, 500, 1000, 5000, 10000, 50000],
'opens': [20, 150, 200, 1500, 2000, 15000]})
# Set number of distributions and trial/control and distribution indexes
tc_idx = data['trial_control'].values
n_tc = data['trial_control'].nunique()
dist_idx = data['distribution'].values
n_dist = data['distribution'].nunique()
# Priors
base_open_rate_mu = -np.log(4) # logistic(-np.log(4)) = 20%
base_open_rate_sigma = 0.125
trial_control_mu = 0
trial_control_sigma = 0.5
distribution_mu = 0
distribution_sigma = 0.5
def logistic(x):
return 1 / (1 + np.exp(-1 * x))
with pm.Model() as open_model:
# Intercept for logisitc fn (baseline open rate: 0.2 = logisitc(-ln(4))).
b0 = pm.Normal('b0', base_open_rate_mu, base_open_rate_sigma)
tc_beta = pm.Normal('trial_control_beta', trial_control_mu, trial_control_sigma, shape=n_tc)
dist_beta = pm.Normal('distribution_beta', distribution_mu, distribution_sigma, shape=n_dist)
# Logistic (outputs P(open))
theta = pm.Deterministic('theta', logistic(b0 + tc_beta[tc_idx] + dist_beta[dist_idx]))
# Data likelihood
likelihood_opens = pm.Binomial('obs', p=theta, n=data['sent'].values, observed=data['opens'].values)
# Inference to approximate the posterior distribution
with open_model:
trace = pm.sample(draws=2000, tune=1000, init='auto')
print(pm.summary(trace))
pm.traceplot(trace)
# Results
mean sd mc_error hpd_2.5 hpd_97.5 \
b0 -1.369596 0.120983 0.002495 -1.604596 -1.133555
trial_control_beta__0 -0.122449 0.237752 0.005715 -0.599465 0.325807
trial_control_beta__1 0.416473 0.236954 0.005724 -0.060768 0.861267
distribution_beta__0 0.102000 0.234017 0.005448 -0.342699 0.575411
distribution_beta__1 0.105062 0.226738 0.005434 -0.321203 0.581093
distribution_beta__2 0.105482 0.225409 0.005391 -0.346519 0.549681
theta__0 0.199793 0.014457 0.000144 0.172692 0.228046
theta__1 0.299519 0.018342 0.000197 0.265566 0.336318
theta__2 0.199953 0.005813 0.000063 0.189028 0.211798
theta__3 0.299875 0.006056 0.000067 0.288166 0.311875
theta__4 0.199985 0.003849 0.000042 0.192620 0.207582
theta__5 0.299932 0.002052 0.000020 0.295843 0.303866
n_eff Rhat
b0 2735.656425 0.999767
trial_control_beta__0 1840.009249 0.999842
trial_control_beta__1 1832.556874 0.999840
distribution_beta__0 1961.204705 0.999970
distribution_beta__1 1834.922021 0.999880
distribution_beta__2 1829.021279 0.999896
theta__0 7264.240538 0.999848
theta__1 7027.712167 0.999860
theta__2 7794.364056 0.999833
theta__3 7843.006894 0.999815
theta__4 8801.569805 0.999757
theta__5 8131.675057 0.999869