Thank you both for your responses so far!
Let’s say its a simple A/B test. Example: What’s the lift of some variant B over the original experience A (control)? How much time needs to pass for their to be enough observations to have an 80% probability of detecting a 5% relative effect in our metric of interest?
Would hierarchical models really work if there were only 2 contexts though? (variant A vs. variant B)
I definitely agree with those points you made, and the Bayesian summary statistics are a really helpful step in the right direction
The approach I typically use (which I’ll show below) I assume does a slightly better job than the typical frequentist NHST because the “p-value” is more meaningful than a frequentist p-value. I.e. calculating P(B > A | data, priors) from a bayesian posterior is more meaningful than a frequentist p-value.
Taking a step back, in general I’d say the question “how long/how many trials would it take to find some meaningful difference between control/variant that we’re x% sure of?” is a fair and valuable question in any domain (frequentist or bayesian). I don’t quite understand why a Bayesian approach would decide this isn’t a meaningful question (am I misinterpreting here)? Is there a better way to answer this question in a Bayesian context than the typical power analysis?
Here’s a full example of how I approach a power analysis now (I originally used the format mentioned in the first post, but just came up with this idea which seems to be faster). This is a simple example, typically I’d have more parameters / user traits and might have a regression in there. Note: I avoided aggregating this into counts of the form Binomial(n, p) and instead chose Binomial(1, p)` since typically I’ll have other user level data, and this will more accurately represent the speed I’d see to simulate power. This takes hours if I were to have 1000 simulated datasets which is what I’d like to solve for
import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
# def generate_data(n, p, p_diff=0.01):
# A = np.random.binomial(n, p, size=1)
# B = np.random.binomial(n, p+p_diff, size=1)
# return A, B
# def create_data(n, p, p_diff=0.01, n_datasets=100, nsim=1):
# convs = np.squeeze( [generate_data(n, p, p_diff, nsim=nsim) for i in range(n_datasets)] )
# return pd.DataFrame({"idx": range(n_datasets), "A":convs[:,0], "B":convs[:,1], "N":n})\
# .set_index(["idx", "N"])\
# .stack().reset_index()\
# .rename(columns={"level_2":"variant", 0:"convert"})
def generate_data_user_level(n, p, p_diff=0.01):
A = np.random.binomial(1, p, size=n)
B = np.random.binomial(1, p+p_diff, size=n)
return A, B
def create_data_user_level(n, p, p_diff=0.01, n_datasets=100):
convs = np.concatenate([generate_data_user_level(n, p, p_diff) for i in range(n_datasets)],axis=-1).T
return pd.DataFrame({"idx": np.repeat(range(n_datasets) , n), "A":convs[:,0], "B":convs[:,1], "N":1})\
.set_index(["idx", "N"])\
.stack().reset_index()\
.rename(columns={"level_2":"variant", 0:"convert"})
# Parameters
n_datasets = 100
n = 1000
a_prior = 15
b_prior = 85
alpha = 0.1
# Simulating data Data
data = create_data_user_level(n, 0.15, 0.01, n_datasets=n_datasets)
# Gathering data
variant_, variant_labels = pd.factorize(data.variant)
dataset_ = data.idx.values
N_variant = len(variant_labels)
print( data.head() )
# Model
with pm.Model() as m1:
p = pm.Beta("p", a_prior, b_prior, shape = (n_datasets, N_variant))
conv = pm.Binomial("conv", data.N.values, p[data.idx.values, variant_], observed=data.convert.values)
trace = pm.sample(return_inferencedata=True)
# Evaluating Power
p_ = trace.posterior["p"].values.reshape(4000, n_datasets, 2)
A = p_[:,:,0]
B = p_[:,:,1]
# Modeling successes/failures with a beta distribution
# here would might more sense so I'd have a distribution of Power
# but this is a quick example
power = ((B > A).mean(axis=0) > alpha).mean()
print("Power: {:.3f}".format(power))