Hi everyone!
a very noob question here about bayesian inference and AB tests. I was following this tutorial with a very simplistic AB-test example, I was wondering how sensitive was the “hypothesis test of delta” with the sample size, this is the code that I was using.
# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04
# increasingly the sample size of both groups (A/B)
for i in range(1,10):
N_A = 15 *i**3
N_B = 7 * i**3
#generate some observations
observations_A = stats.bernoulli.rvs(true_p_A, size=N_A)
observations_B = stats.bernoulli.rvs(true_p_B, size=N_B)
with pm.Model() as model:
p_A = pm.Uniform("p_A", 0, 1)
p_B = pm.Uniform("p_B", 0, 1)
# Define the deterministic delta function. This is our unknown of interest.
delta = pm.Deterministic("delta", p_A - p_B)
# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)
step = pm.Metropolis() # is it metropolis the right algorithm ?
burned_trace = pm.sample(10000, step=step, cores=1, chains=1, progressbar=False)
delta_samples = burned_trace.posterior["delta"].values[0] # only taking the first chain
print(f'{N_A = } {N_B = }')
print("Probability site A is WORSE than site B: %.3f" % \
np.mean(delta_samples < 0))
print("Probability site A is BETTER than site B: %.3f" % \
np.mean(delta_samples > 0)
What I would expect (similar to a frequentist approach) is that with an increase in sample size (in this case with the observable sampling increasing) it goes from a 50%/50% to a more defined “True” value, some kind of 90%/10% with pointing into the right value (PA>PB), and I would also expect that this increase should be directly related with the size of N. But the results made me very puzzled
N_A = 15 N_B = 7
Probability site A is WORSE than site B: 0.262
Probability site A is BETTER than site B: 0.738
N_A = 120 N_B = 56
Probability site A is WORSE than site B: 0.529
Probability site A is BETTER than site B: 0.471
N_A = 405 N_B = 189
Probability site A is WORSE than site B: 0.851
Probability site A is BETTER than site B: 0.149
N_A = 960 N_B = 448
Probability site A is WORSE than site B: 0.251
Probability site A is BETTER than site B: 0.749
N_A = 1875 N_B = 875
Probability site A is WORSE than site B: 0.916
Probability site A is BETTER than site B: 0.084
N_A = 3240 N_B = 1512
Probability site A is WORSE than site B: 0.041
Probability site A is BETTER than site B: 0.959
N_A = 5145 N_B = 2401
Probability site A is WORSE than site B: 0.019
Probability site A is BETTER than site B: 0.981
N_A = 7680 N_B = 3584
Probability site A is WORSE than site B: 0.000
Probability site A is BETTER than site B: 1.000
N_A = 10935 N_B = 5103
Probability site A is WORSE than site B: 0.000
Probability site A is BETTER than site B: 1.000
I’m particularly concerned that, in this run, when the N is ~1000 and is wrong, and “confidently” wrong → near 90% for the false hypothesis.
N_A = 1875 N_B = 875
Probability site A is WORSE than site B: 0.916
Probability site A is BETTER than site B: 0.084
I know there is a concept called chain
that might play a role here (posterior variability). I might mistakenly reduce in this implementation to 1, is that an error?. Is the metropolis algorithm sampling play a role here too?.maybe the distribution (Uniform) should be defined differently?. is there something else that I might be misunderstood and miss interpreting, is this not the right way to run a hypothesis test?, I’m confusing “Confidence” in the bayesian framework and treating it as the equivalent concept in the frequentist test hypothesis ?.
Thanks in advance !!! any help is appreciated !