Understanding params in partially pooled model

Hi,

I am working on an randomized control trial model to determine the estimates for trial/control and a distribution (trial/control email sent on same date). I would expect the distribution beta estimates to be 0 instead of ~0.1, as the testing data controls for the distribution effect. It seems that some of the explanatory power of the trial/control betas is leaking into the distribution. I know this is caused by partial pooling between the trial/control and distribution betas (e.g. trial and control group being used a single distribution results in some of the trial effect leaking into the distribution estimate).

Is there a recommended approach to modeling that would avoid this leakage?

See below for the testing data, model code, and trace summary. Thank you for any feedback!

Data Example:
   trial_control  distribution  sent  opens
0              0             1   100     20
1              1             1   500    150
2              0             2  1000    200
3              1             2  5000   1500
4              0             3  10000  2000
5              1             3  50000  15000
# Model code
data = pd.DataFrame({'trial_control': [0, 1, 0, 1, 0, 1],
                     'distribution': [0, 0, 1, 1, 2, 2],
                     'sent': [100, 500, 1000, 5000, 10000, 50000],
                     'opens': [20, 150, 200, 1500, 2000, 15000]})

# Set number of distributions and trial/control and distribution indexes
tc_idx = data['trial_control'].values
n_tc = data['trial_control'].nunique()
dist_idx = data['distribution'].values
n_dist = data['distribution'].nunique()

# Priors
base_open_rate_mu = -np.log(4)  # logistic(-np.log(4)) = 20%
base_open_rate_sigma = 0.125
trial_control_mu = 0
trial_control_sigma = 0.5
distribution_mu = 0
distribution_sigma = 0.5


def logistic(x):
    return 1 / (1 + np.exp(-1 * x))


with pm.Model() as open_model:
    # Intercept for logisitc fn (baseline open rate: 0.2 = logisitc(-ln(4))).
    b0 = pm.Normal('b0', base_open_rate_mu, base_open_rate_sigma)

    tc_beta = pm.Normal('trial_control_beta', trial_control_mu, trial_control_sigma, shape=n_tc)
    dist_beta = pm.Normal('distribution_beta', distribution_mu, distribution_sigma, shape=n_dist)

    # Logistic (outputs P(open))
    theta = pm.Deterministic('theta', logistic(b0 + tc_beta[tc_idx] + dist_beta[dist_idx]))
    
    # Data likelihood
    likelihood_opens = pm.Binomial('obs', p=theta, n=data['sent'].values, observed=data['opens'].values)


# Inference to approximate the posterior distribution
with open_model:
    trace = pm.sample(draws=2000, tune=1000, init='auto')


print(pm.summary(trace))
pm.traceplot(trace) 

# Results
                           mean        sd  mc_error   hpd_2.5  hpd_97.5  \
b0                    -1.369596  0.120983  0.002495 -1.604596 -1.133555   
trial_control_beta__0 -0.122449  0.237752  0.005715 -0.599465  0.325807   
trial_control_beta__1  0.416473  0.236954  0.005724 -0.060768  0.861267   
distribution_beta__0   0.102000  0.234017  0.005448 -0.342699  0.575411   
distribution_beta__1   0.105062  0.226738  0.005434 -0.321203  0.581093   
distribution_beta__2   0.105482  0.225409  0.005391 -0.346519  0.549681   
theta__0               0.199793  0.014457  0.000144  0.172692  0.228046   
theta__1               0.299519  0.018342  0.000197  0.265566  0.336318   
theta__2               0.199953  0.005813  0.000063  0.189028  0.211798   
theta__3               0.299875  0.006056  0.000067  0.288166  0.311875   
theta__4               0.199985  0.003849  0.000042  0.192620  0.207582   
theta__5               0.299932  0.002052  0.000020  0.295843  0.303866   

                             n_eff      Rhat  
b0                     2735.656425  0.999767  
trial_control_beta__0  1840.009249  0.999842  
trial_control_beta__1  1832.556874  0.999840  
distribution_beta__0   1961.204705  0.999970  
distribution_beta__1   1834.922021  0.999880  
distribution_beta__2   1829.021279  0.999896  
theta__0               7264.240538  0.999848  
theta__1               7027.712167  0.999860  
theta__2               7794.364056  0.999833  
theta__3               7843.006894  0.999815  
theta__4               8801.569805  0.999757  
theta__5               8131.675057  0.999869 

I don’t think this is related to trial_control at all, but to the normal prior on the intercept. The problem goes away if I replace that with a b0 = pm.Flat('b0').

2 Likes

Thank you! That worked perfectly.