I am trying to come up with a model for Bayesian sample size estimation given a HDI width at 95% HPD, and a small data sample. Basically, this is power analysis but in Bayesian framework.
The background problem: How many data examples are necessary to estimate model performances with estimation falling in 95% HPD with width 0.2? A sample of data is given.
The data can be presented in form of a binary vector of 1s and 0s: 1 - correctly classified example, 0 - incorrectly classified. The performance metric is accuracy - number of correctly classified examples (sum of data vector divided by length).
I can use that sample in Beta-Binomial model where the parameter “p” is the success rate or the accuracy. I don’t know how to express and tie in the given confidence interval width.
What I am interested in, is to obtain a probability distribution over sample sizes “n”. I would expect the probability to be low for small “n” and approach 1 as the sample size increases. Naturally one obtains increasingly narrower HPD as more data is given.
Am I missing some essential reading? Is this model well-defined?
If this is the aim, you can model sample size “n” as a parameter of a linear model: observed_accuracy = f(sample size “n”). It will need data input as pair of {sample size, accuracy}. And you just place prior on “n” and maybe also the intercept of the linear model.
Thanks! This is an interesting idea and fairly easy to implement and understand! The only concern is that the process can be non-linear. May be try a GP here?
The actual goal, after discussing with a colleague, is to estimate the smallest dataset size at which the 95% HDI width of the estimate is certain value, say, 0.2. A sample of data (a vector of 1s and 0s) is given and can be used to “bootstrap” the model.
The Ch. 13 of John Kruschke’s “puppy book” looks very relevant. I read it fairly quickly and didn’t understand it enough to apply on my data immediately. Osvaldo Martin kindly implemented some of the code in Python. Need to look more into how to use it.
Yeah, I ended up going sampling way. I used Beta-Binomial model to estimate the success rate and then used the parameter estimate to generate increasingly larger synthetic datasets. From each such dataset, I estimated the parameter and especially the 95% HDI width.
For Beta-Binomial model, I wonder whether I can quickly compute the confidence interval? The Beta distribution is conjugate to Binomial, and the posterior parameters can be obtained w/o any sampling.
I would say you don’t need a PyMC3 model in this case, just using the formula to update the posterior is enough.
But of course, have the PyMC3 model you can easily extend it to other more complicated model (which might not have conjugate prior and difficult to work out in math), for that, a few tips:
You don’t need to repeatedly define your model (might be slow to always compile the same model especially if it is a large one). For example, you can do the following:
...
# define your model
n = theano.shared(100, dtype=int))
X = theano.shared(50, dtype=float))
with model:
p = pm.Beta('p', alpha=2, beta=2)
y_obs = pm.Binomial('y_obs', p=p, n=n, observed=X)
sample_size = [10, 100, 1000, 10000]
# for loop
for s in sample_size:
n.set_value(s)
X.set_value(np.sum(new_observed_X))
with model:
trace = pm.sample()
# compute HPD.
Thanks! I have two questions:
(1) Yes, I can use the conjugacy property of Beta and Binomial. How do I find the 95% HDI interval width for the posterior in this setting? I suspect that I miss some basics here.
(2) Using theano.shared() is the way to go! Does this work if the X is an array which changes size from loop to loop?
(1) you need to work out the inverse CDF of Beta distribution and compute inverseCDF_Beta(.975) - inverseCDF_Beta(.025)
(2) The X does not change size as you are setting the value using np.sum(new_observed_X)
In this case what is new_observed_X? Is it a random variable computed by something like np.random.binomial(1,p_true, size=s)? Or is it simply p_true*s (where s is sample size)?
Should there be repeated simulation or is this really a sufficient method? And if so, why?
I’ve updated @dovgalec 's code to work with current versions, and incorporated some of the suggestions above. Hopefully it’s useful to someone else who finds themselves here at the end of a Google search.
I’m using pm.Data instead of theano.shared; I couldn’t get the latter to work.
import arviz as az
import logging
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
from scipy.stats import bernoulli
logger = logging.getLogger('pymc3')
logger.setLevel(logging.ERROR)
p_true = 0.60 # hidden success rate
with pm.Model() as model:
# initialized with bootstrap values
n = pm.Data('n', 100)
data = pm.Data('data', bernoulli.rvs(p_true, size=100).sum())
p = pm.Beta('p', alpha=2, beta=2)
y_obs = pm.Binomial('y_obs', p=p, n=n, observed=data)
# Bootstrap initial p_hat
with model:
trace = pm.sample(10000, progressbar=False, return_inferencedata=True)
interval = az.hdi(trace, 0.05)
interval_width = interval['p'][1] - interval['p'][0]
summary = az.summary(trace)
p_hat = summary["mean"]["p"]
sample_size = [10, 100, 1000, 10000]
observed_data =[bernoulli.rvs(p_hat, size=s).sum() for s in sample_size]
# Iterate over samples
interval_widths = list()
for i in range(len(sample_size)):
with model:
pm.set_data({'data': observed_data[i], 'n': sample_size[i]})
trace = pm.sample(progressbar=False, return_inferencedata=True)
interval = az.hdi(trace, 0.05)
interval_width = interval['p'][1] - interval['p'][0]
interval_widths.append(float(interval_width))
fig = plt.figure()
plt.plot( np.asarray(sample_size), np.asarray(interval_widths) )
plt.yscale("log")
plt.title("Credible Interval Precision Curve")
plt.ylabel("HDI Width")
plt.xlabel("Sample Size")
plt.show()