I’m working on a binomial model that in it’s simplest form has two parameters, one based on treatment received and the other based on gender.
I’m most interested in the p
parameter and wanted to get feedback on whether I’m interpreting things correctly. The ultimate goal is to be able to work with a sample set that isn’t necessarily balanced between genders in order to make inferences about the value of p
for the two treatments for the entire population.
Here’s an example:
import arviz as az
import numpy as np
import pymc3 as pm
n = 1_000
population_gender_share = [.5, .5]
sample_gender_share = [.75, .25]
true_male_p = 0.025
true_female_p = 0.075
true_treatment_effect = 0.5
treatment = np.random.choice([0, 1], n)
female = np.random.choice([0, 1], n, p=sample_gender_share)
p_clicked = ((true_female_p * female) + (true_male_p * (female - 1) * -1)) * (
1 + treatment * true_treatment_effect
)
outcome = np.random.binomial(1, p=p_clicked)
with pm.Model() as m:
b_treatment = pm.Normal("b_treatment", 0, 1.5, shape=2)
treatment_id = pm.intX(pm.Data("treatment_id", treatment))
b_gender = pm.Normal("b_gender", 0, 0.5, shape=2)
gender_id = pm.intX(pm.Data("gender_id", female))
p = pm.Deterministic(
"p", pm.math.invlogit(b_gender[gender_id] + b_treatment[treatment_id])
)
clicked = pm.Binomial("clicked", 1, p, observed=outcome)
trace = pm.sample(random_seed=123, tune=2000)
prior = pm.sample_prior_predictive(samples=1000)
idata_model = az.from_pymc3(
prior=prior,
trace=trace,
)
pm.set_data({"treatment_id": [0, 0, 1, 1], "gender_id": [0, 1, 0, 1]})
post = pm.sample_posterior_predictive(
idata_model.posterior, random_seed=123, var_names=["p"]
)
az.plot_forest(
[
{
"control": np.average(
post["p"][:, :2],
weights=population_gender_share,
axis=1,
),
"treatment": np.average(
post["p"][:, 2:],
weights=population_gender_share,
axis=1,
),
},
],
combined=True,
);
Is the way I’m working with p
in the posterior sample valid for inference?