Strategies for speeding up power analyses

I’m using the following pseudocode strategy for power analyses with pymc3

nsim = 1000
mu_diff = 0.1
model = compile_model()

for i in range(nsim):
   A, B = simulate_data(mu_diff)
   
   with model:
       pm.set_data({"dataA":A, "dataB":B})
       trace = pm.sample()
   
    res[i] = test_for_difference(trace)

print("power: ", res.mean())

Is there any way to take this a step further on just a local setup? Is it even possible to fit multiple models at the same time on a local device? I’ve already starting pushing the limits in terms of reducing chains, number of samples, and number of tuning steps

What you are asking for is beyond the scope of PyMC3, in my opinion.

The simplest way to perform multiple jobs in parallel in Python is via the threading built-in module, but that won’t help much for anything computational because all jobs run on the same core. The multiprocessing module works similarly to the threading module but uses more than one core. I assume that it is not recommended to use either of these approaches in conjunction with PyMC3 since parallelization is integrated already when sampling, but I’ve never actually tried it.

I can’t imagine why you would want to perform power analysis—which I understand to be a frequentist concept related to the probability of accepting/rejecting the null hypothesis via p-values in simulated data—in PyMC3 in the first place. Unless “power” refers to something else here?

The problem is I have nsim=1000 different datasets and from my understanding threading implies fitting a model to the same dataset… although maybe I could add all 1000 datasets into 1 dataframe, index them, and make variable shapes the same length as nsim… I might try that and see if things speed up

In terms of Power Analyses for Bayesian approaches, from my very limited experience NHST is quick operationally and pretty clear cut for decision making and figuring out timelines in industry - I have yet to have someone show me a better approach apart from bandit algorithms (though I’d love to see one).

I’ve also found its difficult to find the “right” priors due to seasonality and co-interventions so using an NHST approach with a bayesian model feels like a slightly more conservative approach that matches my experience level.

Threading just means running the iterations of a loop asynchronously rather than in order. You can thread anything, including the same analysis of different data or different analyses of the same data. As I said though, I don’t think threading is going to help you much since your jobs are computational. Multiprocessing might help, but again, PyMC3 already does multiprocessing under the hood.

Based on what you wrote here, I’ve no clue what you are trying to accomplish.

In terms of Power Analyses for Bayesian approaches, from my very limited experience NHST is quick operationally and pretty clear cut for decision making and figuring out timelines in industry - I have yet to have someone show me a better approach apart from bandit algorithms (though I’d love to see one).

I’ve also found its difficult to find the “right” priors due to seasonality and co-interventions so using an NHST approach with a bayesian model feels like a slightly more conservative approach that matches my experience level.

I’m not sure what your exact use case is, but NHST doesn’t sound like it’s a good fit for what you’re looking for. Usually, if you’re having difficulty setting priors, the best response is to use a hierarchical model. A hierarchical model will learn its priors from the data. If hierarchical priors don’t help, and your priors still have a lot of influence on the outcome, this usually indicates that you don’t have enough data; NHST won’t help with that. What exactly are you trying to model here, and can I take a look at your model?

As for being clear-cut, this is a major problem with NHST, not an advantage. With NHST, everything gets rounded off to either “Accept the null hypothesis” or “Reject the null hypothesis.” This is bad practice: p-values of .049 and .051 are not meaningfully different from each other.

With all that said, here are Bayesian summary statistics that might be useful for what you’re looking to do.

  1. Signal-to-noise ratio. The signal-to-noise ratio is equal to the expectation of the squared effect size, divided by the variance in your estimate. A signal-to-noise ratio lets you estimate how much of the variance in your estimates will be caused by your estimates improving over time, and how much will be caused by random chance. For instance, a signal-to-noise ratio of 3:1 indicates that 25% of the variance in your estimates will be explained by random chance.
  2. Pick an effect size that you think is “meaningful” enough that you care about any effect sizes larger than it. Then, calculate the probability that the effect size is larger than that effect size. If that probability is very small, you can safely say you’ve ruled out meaningful effect sizes. (Or, alternatively, calculate the probability that the real effect is smaller than this threshold, and then see whether you can confidently say the effect size is meaningfully large.)
  3. Bayesian p-values and u-values are good for assessing model fit and prior fit. Extreme p-values (far from 0.5) indicate that your priors and model have a bad fit to the data. If all your p-values are close to 0.5, don’t worry too much about your priors. (Make sure to calculate the p-values based on several statistics of your data. Skew and kurtosis are the two classics, but I also suggest something like the ELPPD that can take into account holistic model fit.)

Thank you both for your responses so far!

Let’s say its a simple A/B test. Example: What’s the lift of some variant B over the original experience A (control)? How much time needs to pass for their to be enough observations to have an 80% probability of detecting a 5% relative effect in our metric of interest?

Would hierarchical models really work if there were only 2 contexts though? (variant A vs. variant B)

I definitely agree with those points you made, and the Bayesian summary statistics are a really helpful step in the right direction

The approach I typically use (which I’ll show below) I assume does a slightly better job than the typical frequentist NHST because the “p-value” is more meaningful than a frequentist p-value. I.e. calculating P(B > A | data, priors) from a bayesian posterior is more meaningful than a frequentist p-value.

Taking a step back, in general I’d say the question “how long/how many trials would it take to find some meaningful difference between control/variant that we’re x% sure of?” is a fair and valuable question in any domain (frequentist or bayesian). I don’t quite understand why a Bayesian approach would decide this isn’t a meaningful question (am I misinterpreting here)? Is there a better way to answer this question in a Bayesian context than the typical power analysis?

Here’s a full example of how I approach a power analysis now (I originally used the format mentioned in the first post, but just came up with this idea which seems to be faster). This is a simple example, typically I’d have more parameters / user traits and might have a regression in there. Note: I avoided aggregating this into counts of the form Binomial(n, p) and instead chose Binomial(1, p)` since typically I’ll have other user level data, and this will more accurately represent the speed I’d see to simulate power. This takes hours if I were to have 1000 simulated datasets which is what I’d like to solve for

import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt


# def generate_data(n, p, p_diff=0.01):
#     A = np.random.binomial(n, p, size=1)
#     B = np.random.binomial(n, p+p_diff, size=1)
#     return A, B

# def create_data(n, p, p_diff=0.01, n_datasets=100, nsim=1):
#     convs = np.squeeze( [generate_data(n, p, p_diff, nsim=nsim)  for i in range(n_datasets)] )
#     return pd.DataFrame({"idx": range(n_datasets), "A":convs[:,0], "B":convs[:,1], "N":n})\
#                 .set_index(["idx", "N"])\
#                 .stack().reset_index()\
#                 .rename(columns={"level_2":"variant", 0:"convert"})
    
    
    
def generate_data_user_level(n, p, p_diff=0.01):
    A = np.random.binomial(1, p, size=n)
    B = np.random.binomial(1, p+p_diff, size=n)
    return A, B

def create_data_user_level(n, p, p_diff=0.01, n_datasets=100):
    
    convs = np.concatenate([generate_data_user_level(n, p, p_diff)  for i in range(n_datasets)],axis=-1).T
    return pd.DataFrame({"idx": np.repeat(range(n_datasets) , n), "A":convs[:,0], "B":convs[:,1], "N":1})\
                .set_index(["idx", "N"])\
                .stack().reset_index()\
                .rename(columns={"level_2":"variant", 0:"convert"})
    
    
# Parameters
n_datasets = 100
n = 1000
a_prior = 15
b_prior = 85
alpha = 0.1 

# Simulating data Data
data = create_data_user_level(n, 0.15, 0.01, n_datasets=n_datasets)

# Gathering data
variant_, variant_labels = pd.factorize(data.variant)
dataset_ = data.idx.values
N_variant = len(variant_labels)
print( data.head() )

# Model
with pm.Model() as m1:
    p = pm.Beta("p", a_prior, b_prior, shape = (n_datasets, N_variant))
    conv = pm.Binomial("conv", data.N.values, p[data.idx.values, variant_], observed=data.convert.values)
    trace = pm.sample(return_inferencedata=True)
    
# Evaluating Power
p_ = trace.posterior["p"].values.reshape(4000, n_datasets, 2)
A = p_[:,:,0]
B = p_[:,:,1]

# Modeling successes/failures with a beta distribution 
# here would might more sense so I'd have a distribution of Power
# but this is a quick example
power = ((B > A).mean(axis=0) > alpha).mean() 
print("Power: {:.3f}".format(power))

I’m coming in late to the thread, and have only really read the most recent post. I agree with you, Kyle, that power analysis is perfectly well-motivated in a Bayesian setting. One could re-cast it as a "how many samples do I need for the posterior variance of the estimate of the difference in rates/means to be smaller than \theta" – and then note that for a fixed difference \Delta_\mu this is equivalent to a statement about p-values.

As far as speeding up the model – even though the models are independent, using NUTS to sample from a (1000, 1000) space is (I suspect) slower than sampling from a (100, 100) space 100 times.

In this specific case where you have used a conjugate distribution (beta/binomial); the posterior has a closed-form solution. I’d recommend using pm.Deterministic to compute the parameters and pass them into a posterior = pm.Beta(...); that way you can just use pm.sample_prior_predictive and forgo the use of NUTS entirely.

The question you’re asking is meaningful, in that you can calculate it and it will give you a meaningful answer. However, if you’re doing A/B testing on a website or something similar and want to figure out how big a sample size you’ll need, I would actually suggest taking advantage of stopping-rule independence. Bayesian statistics don’t actually require the sample size be fixed ahead of time, so you can just collect data until your estimates have the necessary precision. If you collect 1000 samples and you find that’s not enough for your purposes (e.g. the precision doesn’t let you rule out some meaningfully large effect sizes) you can just keep collecting more data until you have enough information to be able to make a decision.

This might be a bit counterintuitive if you’re used to frequentist statistics, so here’s a good way of explaining stopping rule independence: What’s the probability of getting 10 successes in 60 trials if you weren’t planning to keep going? Now, what’s the probability of getting 10 successes in 60 trials if you were planning to keep going? As long as your plans don’t affect the likelihood function (e.g. you’re not intentionally trying to take measurements only when a lot of people are visiting your website) they don’t affect the probability.

To clarify my position, I wasn’t saying it doesn’t makes sense to do power analysis in a Bayesian context, just that I had never heard of it called that. I’ve only ever seen the term used in grant applications, etc. were the goal is to figure out how much data you need to achieve sufficient power or how sensitive your study will be given the budget … in my experience this has always been strictly frequentist. I would call what your describing a simulation study or a parameter-recovery experiment … which I guess is just splitting hairs.

I think @chartl is correct about figuring out the posterior value of p in closed form, then sampling from this distribution directly. I don’t think you need PyMC3 at all for this. For example, if you know p you can generate posterior samples via scipy.stats.binomial.rvs.