Bayesian sample size estimation for given HPD

dovgalec · July 25, 2017, 3:38am

I am trying to come up with a model for Bayesian sample size estimation given a HDI width at 95% HPD, and a small data sample. Basically, this is power analysis but in Bayesian framework.

The background problem: How many data examples are necessary to estimate model performances with estimation falling in 95% HPD with width 0.2? A sample of data is given.

The data can be presented in form of a binary vector of 1s and 0s: 1 - correctly classified example, 0 - incorrectly classified. The performance metric is accuracy - number of correctly classified examples (sum of data vector divided by length).

I can use that sample in Beta-Binomial model where the parameter “p” is the success rate or the accuracy. I don’t know how to express and tie in the given confidence interval width.

What I am interested in, is to obtain a probability distribution over sample sizes “n”. I would expect the probability to be low for small “n” and approach 1 as the sample size increases. Naturally one obtains increasingly narrower HPD as more data is given.

Am I missing some essential reading? Is this model well-defined?

junpenglao · July 25, 2017, 9:23am

If this is the aim, you can model sample size “n” as a parameter of a linear model: observed_accuracy = f(sample size “n”). It will need data input as pair of {sample size, accuracy}. And you just place prior on “n” and maybe also the intercept of the linear model.

dovgalec · July 25, 2017, 2:55pm

Thanks! This is an interesting idea and fairly easy to implement and understand! The only concern is that the process can be non-linear. May be try a GP here?

The actual goal, after discussing with a colleague, is to estimate the smallest dataset size at which the 95% HDI width of the estimate is certain value, say, 0.2. A sample of data (a vector of 1s and 0s) is given and can be used to “bootstrap” the model.

The Ch. 13 of John Kruschke’s “puppy book” looks very relevant. I read it fairly quickly and didn’t understand it enough to apply on my data immediately. Osvaldo Martin kindly implemented some of the code in Python. Need to look more into how to use it.

github.com

aloctavodia/Doing_bayesian_data_analysis/blob/master/13_minNforHDIpower.py

"""
The program described in this section was used to generate Tables 13.1 and 13.2.
The program determines the minimal sample size needed to achieve a specified 
goal with a specified power, when flipping a single coin.
"""
import numpy as np
from HDIofICDF import *
from scipy.special import binom, betaln


def minNforHDIpower(genPriorMean, genPriorN, HDImaxwid=None, nullVal=None,
                    ROPE=None, desiredPower=0.8, audPriorMean=0.5,
                    audPriorN=2, HDImass=0.95, initSampSize=1, verbose=True):
    if HDImaxwid != None  and nullVal != None:
        sys.exit('One and only one of HDImaxwid and nullVal must be specified')
    if ROPE == None:
        ROPE = [nullVal, nullVal]
   # Convert prior mean and N to a, b parameter values of beta distribution.
    genPriorA = genPriorMean * genPriorN
    genPriorB = (1.0 - genPriorMean) * genPriorN

This file has been truncated. show original

junpenglao · July 25, 2017, 6:49pm

I would say simulated a few data point (eg fit 10, 100, 1000 data points etc and get the width of 95%HDI), and fit a linear or log-linear model.

dovgalec · July 25, 2017, 9:27pm

Yeah, I ended up going sampling way. I used Beta-Binomial model to estimate the success rate and then used the parameter estimate to generate increasingly larger synthetic datasets. From each such dataset, I estimated the parameter and especially the 95% HDI width.

For Beta-Binomial model, I wonder whether I can quickly compute the confidence interval? The Beta distribution is conjugate to Binomial, and the posterior parameters can be obtained w/o any sampling.

dovgalec · July 25, 2017, 9:35pm

Here is the code. Please let me know if something is wrong or could be done more efficiently:

import sys

import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt

from scipy.stats import bernoulli

p_true = 0.60       # hidden success rate
n = 100             # sample size

# synthetic dataset
X = bernoulli.rvs(p_true, size=n)
print(X)

print( "*** BOOTSTRAPPING ***" )

model = pm.Model()

with model:
    p = pm.Beta('p', alpha=2, beta=2)
    y_obs = pm.Binomial('y_obs', p=p, n=n, observed=np.sum(X))
    step = pm.Metropolis()
    trace = pm.sample(10000, step=step, progressbar=False)

interval = pm.hpd(trace, 0.05)
interval_width = interval['p'][1] - interval['p'][0]
print("Interval width: %f" % interval_width)

p_hat = np.mean(trace["p"])
print("Estimated success rate: %f" % p_hat)

#mu_p = p_hat
#sd_p = np.std(trace["p"])

#pm.traceplot(trace)
#pm.plot_posterior(trace)
#plt.show()
#sys.exit(0)

print("\n\nIterations")

s = 100
sample_sizes = list()
avg_interval_widths = list()

while s < 1000:

    s = s + 50
    sample_sizes.append(s)

    interval_widths = list()

    for i in range(5):
        # generate data
        x = bernoulli.rvs(p_hat, size=s)

        with pm.Model():
            p = pm.Beta('p', alpha=2, beta=2)
            y_obs = pm.Binomial('y_obs', p=p, n=s, observed=np.sum(x))
            step = pm.Metropolis()
            trace = pm.sample(10000, step=step, progressbar=False)

        interval = pm.hpd(trace, 0.05)
        interval_width = interval['p'][1] - interval['p'][0]
        interval_widths.append(interval_width)
        print("\tInterval width: %f" % interval_width)

        p_hat = np.mean(trace["p"])
        print("\tEstimated success rate: %f" % p_hat)

        print()

    avg_interval_width = sum(interval_widths) / len(interval_widths)
    avg_interval_widths.append(avg_interval_width)

    print("\n*** SAMPLE SIZE: %d ***" % s)
    print("Average interval width: %f" % avg_interval_width)


fig = plt.figure()
plt.plot( np.asarray(sample_sizes), np.asarray(avg_interval_widths) )
plt.show()

junpenglao · July 26, 2017, 7:21am

I would say you don’t need a PyMC3 model in this case, just using the formula to update the posterior is enough.

But of course, have the PyMC3 model you can easily extend it to other more complicated model (which might not have conjugate prior and difficult to work out in math), for that, a few tips:

You don’t need to repeatedly define your model (might be slow to always compile the same model especially if it is a large one). For example, you can do the following:

...
# define your model
n = theano.shared(100, dtype=int))
X = theano.shared(50, dtype=float))
with model:
    p = pm.Beta('p', alpha=2, beta=2)
    y_obs = pm.Binomial('y_obs', p=p, n=n, observed=X)

sample_size = [10, 100, 1000, 10000]
# for loop
for s in sample_size:
    n.set_value(s)
    X.set_value(np.sum(new_observed_X))
    with model:
        trace = pm.sample()
    # compute HPD.

dovgalec · July 26, 2017, 4:27pm

Thanks! I have two questions:
(1) Yes, I can use the conjugacy property of Beta and Binomial. How do I find the 95% HDI interval width for the posterior in this setting? I suspect that I miss some basics here.
(2) Using theano.shared() is the way to go! Does this work if the X is an array which changes size from loop to loop?

junpenglao · July 26, 2017, 5:05pm

(1) you need to work out the inverse CDF of Beta distribution and compute inverseCDF_Beta(.975) - inverseCDF_Beta(.025)
(2) The X does not change size as you are setting the value using np.sum(new_observed_X)

KyleJCaron · August 21, 2020, 9:48pm

In this case what is new_observed_X? Is it a random variable computed by something like np.random.binomial(1,p_true, size=s)? Or is it simply p_true*s (where s is sample size)?

Should there be repeated simulation or is this really a sufficient method? And if so, why?

isosphere · August 5, 2021, 2:44am

I’ve updated @dovgalec 's code to work with current versions, and incorporated some of the suggestions above. Hopefully it’s useful to someone else who finds themselves here at the end of a Google search.

I’m using pm.Data instead of theano.shared; I couldn’t get the latter to work.

import arviz as az
import logging
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli

logger = logging.getLogger('pymc3')
logger.setLevel(logging.ERROR)

p_true = 0.60       # hidden success rate

with pm.Model() as model:
    # initialized with bootstrap values
    n = pm.Data('n', 100)
    data = pm.Data('data', bernoulli.rvs(p_true, size=100).sum())    
    
    p = pm.Beta('p', alpha=2, beta=2)
    y_obs = pm.Binomial('y_obs', p=p, n=n, observed=data)
    
# Bootstrap initial p_hat
with model:
    trace = pm.sample(10000, progressbar=False, return_inferencedata=True)
    interval = az.hdi(trace, 0.05)
    interval_width = interval['p'][1] - interval['p'][0]
    summary = az.summary(trace)
    p_hat = summary["mean"]["p"]

sample_size = [10, 100, 1000, 10000]
observed_data =[bernoulli.rvs(p_hat, size=s).sum() for s in sample_size]

# Iterate over samples
interval_widths = list()
for i in range(len(sample_size)):
    with model:
        pm.set_data({'data': observed_data[i], 'n': sample_size[i]})
        trace = pm.sample(progressbar=False, return_inferencedata=True)
        
        interval = az.hdi(trace, 0.05)
        interval_width = interval['p'][1] - interval['p'][0]
        interval_widths.append(float(interval_width))
    
fig = plt.figure()
plt.plot( np.asarray(sample_size), np.asarray(interval_widths) )
plt.yscale("log")
plt.title("Credible Interval Precision Curve")
plt.ylabel("HDI Width")
plt.xlabel("Sample Size")
plt.show()

Topic		Replies	Views
Bayesian AB Test Duration and Sample Size version agnostic	3	383	January 18, 2024
Sensible prior for logistic regression with one or no input variable Questions	2	462	December 5, 2021
Estimating prior sd for the parameter p of a beta binomial regression Questions	6	470	March 26, 2020
Computing Prediction Intervals without sampling Questions	4	1036	May 16, 2018
Significant digits of pm.stats.hpd Questions	1	403	December 15, 2020

Bayesian sample size estimation for given HPD

Related topics