Noob question: Bayesian inference in AB test vs frequentist approach (N size)

Hi everyone!
a very noob question here about bayesian inference and AB tests. I was following this tutorial with a very simplistic AB-test example, I was wondering how sensitive was the “hypothesis test of delta” with the sample size, this is the code that I was using.

# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04
# increasingly the sample size of both groups (A/B)
for i in range(1,10):
    N_A = 15 *i**3
    N_B = 7 * i**3

    #generate some observations
    observations_A = stats.bernoulli.rvs(true_p_A, size=N_A)
    observations_B = stats.bernoulli.rvs(true_p_B, size=N_B)
    with pm.Model() as model:
        p_A = pm.Uniform("p_A", 0, 1)
        p_B = pm.Uniform("p_B", 0, 1)
        # Define the deterministic delta function. This is our unknown of interest.
        delta = pm.Deterministic("delta", p_A - p_B)
        
        # Set of observations, in this case we have two observation datasets.
        obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
        obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)
        step = pm.Metropolis() # is it metropolis the right algorithm ?
        burned_trace = pm.sample(10000, step=step, cores=1, chains=1, progressbar=False)

    delta_samples = burned_trace.posterior["delta"].values[0] # only taking the first chain 
    print(f'{N_A = } {N_B = }')
    print("Probability site A is WORSE than site B: %.3f" % \
        np.mean(delta_samples < 0))
    print("Probability site A is BETTER than site B: %.3f" % \
        np.mean(delta_samples > 0)

What I would expect (similar to a frequentist approach) is that with an increase in sample size (in this case with the observable sampling increasing) it goes from a 50%/50% to a more defined “True” value, some kind of 90%/10% with pointing into the right value (PA>PB), and I would also expect that this increase should be directly related with the size of N. But the results made me very puzzled

N_A = 15 N_B = 7
Probability site A is WORSE than site B: 0.262
Probability site A is BETTER than site B: 0.738
N_A = 120 N_B = 56
Probability site A is WORSE than site B: 0.529
Probability site A is BETTER than site B: 0.471
N_A = 405 N_B = 189
Probability site A is WORSE than site B: 0.851
Probability site A is BETTER than site B: 0.149
N_A = 960 N_B = 448
Probability site A is WORSE than site B: 0.251
Probability site A is BETTER than site B: 0.749
N_A = 1875 N_B = 875
Probability site A is WORSE than site B: 0.916
Probability site A is BETTER than site B: 0.084
N_A = 3240 N_B = 1512
Probability site A is WORSE than site B: 0.041
Probability site A is BETTER than site B: 0.959
N_A = 5145 N_B = 2401
Probability site A is WORSE than site B: 0.019
Probability site A is BETTER than site B: 0.981
N_A = 7680 N_B = 3584
Probability site A is WORSE than site B: 0.000
Probability site A is BETTER than site B: 1.000
N_A = 10935 N_B = 5103
Probability site A is WORSE than site B: 0.000
Probability site A is BETTER than site B: 1.000

I’m particularly concerned that, in this run, when the N is ~1000 and is wrong, and “confidently” wrong → near 90% for the false hypothesis.

N_A = 1875 N_B = 875
Probability site A is WORSE than site B: 0.916
Probability site A is BETTER than site B: 0.084

I know there is a concept called chain that might play a role here (posterior variability). I might mistakenly reduce in this implementation to 1, is that an error?. Is the metropolis algorithm sampling play a role here too?.maybe the distribution (Uniform) should be defined differently?. is there something else that I might be misunderstood and miss interpreting, is this not the right way to run a hypothesis test?, I’m confusing “Confidence” in the bayesian framework and treating it as the equivalent concept in the frequentist test hypothesis ?.

Thanks in advance !!! any help is appreciated ! :slight_smile:

Welcome!

There are a couple of things going on here.

One thing I can see is that you are dichotomizing the difference between the inferred p_A and p_B (i.e., you’re counting the number of samples for which delta_samples < 0, or equivalently, for which p_A<p_b). This throws out information about the magnitude of the inferred difference. I might suggest looking at the full posterior distribution of delta_samples to more fully interpret the result of your inference. If you see that the mean of delta_samples is 0.5, you have no idea if the posterior is flat or a spike centered at 0.5.

The other thing is that you are stochastically generating data. So the probabilities you define at the top, true_p_A and true_p_B are “correct”, but can yield data that ultimately suggests very different values of p_A and p_B. So what the “correct” answer is will depend on what you’re looking for. If you want to be faithful to the data, then true_p_A and true_p_B are largely irrelevant once the data is generated. You’re printing the summary of the data, but it’s important to keep in mind.

1 Like

Thanks for the fast reply @cluhmann ! :slight_smile:

I might suggest looking at the full posterior distribution of delta_samples to more fully interpret the result of your inference.

ohhh I see, I follow the book example but makes way more sense to look into the posterior of delta distribution. How do you run a hypothesis test using that distribution?, something like IP(PA>PB). It’s just about estimating the area of the distribution when delta >0?. what is the “correct” way of testing a hypothesis?

but can yield data that ultimately suggests very different values of p_A and p_B

yes, that’s correct, but shouldn’t that be irrelevant when the samples are in the order of a thousand?

then true_p_A and true_p_B are largely irrelevant once the data is generated.

I assume that’s because the sampled p_A might be different than true_p_A. I agree that that’s probably the case when the sample is not very large. although I expect that p_Atrue_p_A when the sample size is “large enough”. But yes you are right, might happen that p_A < p_B for a particular random sample.
Thanks again for the help :slight_smile:

You are currently running a type of hypothesis test, but there are other ways. How you use a full distribution very much depends on what you are trying to do. There is no single, universal way to throw away information (i.e., go from a full distribution to a single number). You can check the wikipedia on Bayesian estimators for more info.

Should it be? Probably. Will it definitely be irrelevant? No. I can flip a fair coin a thousand times and have it come up heads every single time. If that happens, I would hope my inference reflects the data and not the “true” value of \theta=0.5.

Thanks again for the reply!.

How you use a full distribution very much depends on what you are trying to do

In this particular example was trying to test a mean test differences H0:PA<=PB similar to a t-test, is there a way in pymc to integrate the posterior? (not sure if that’s the way to proceed, I read the Wikipedia article but I might be missing something).

Thanks again for answering all the questions! (a noob learning here :slight_smile: )

Yup, very reasonable! The problem, as I mentioned is that there is no single, unique way of doing this. I would suggest checking out procedures such as the BEST aka “Bayesian estimation supersedes the t test” and Bayes factors as well as things like Andrew Gelman’s discussion of “type S” and “type M” errors. At that point you’ll either have found something that meets your needs, know enough to construct your own “test”, or be utterly confused (but hopefully one of the first 2).

I would also suggest attending one of the PyMC office hours we regularly hold. Future dates will be posted on the PyMC Meetup page.

1 Like

Thanks a lot !!! I will take a look!, thanks for all the help :raised_hands::raised_hands::raised_hands:
Edit: BEST pymc

1 Like