Error while sampling from Binomial with large n

I am facing the following issue with v4.2.2 (also tried with v5.0, but same issue persists):

Sampling from a Binomial with a beta prior seems to proceed without issues as long as the number of trials is <2.2bn, but for n >= 2.2bn the initial evaluation checks fail.

It also fails for 2.3bn and 3.0bn, but works well for smaller numbers.

Sharing code, screenshots and the stack trace of a minimal example to reproduce:

Code:

for n in [2100000000,2200000000]:
    n = int(n)
    exclusive_count = np.random.binomial(n=n,
                                         p=0.25,
                                         size=1)
    print(f'Sampling with n = {n}, exclusive_count = {exclusive_count}')
    with pm.Model() as model:
        p_exclusive_id = pm.Beta('p_exclusive_id',
                                 alpha=1.0,
                                 beta=1.0)
        n_exclusive_id = pm.Binomial('n_exclusive_id',
                                     n = n,
                                     p = p_exclusive_id,
                                     observed = exclusive_count) 
        idata = pm.sample()

Execution Screenshot:

Stack Trace of the error:

---------------------------------------------------------------------------
SamplingError                             Traceback (most recent call last)
<ipython-input-33-b527c150c47a> in <module>
     13                                      p = p_exclusive_id,
     14                                      observed = exclusive_count) 
---> 15         idata = pm.sample()

~/PycharmProjects/cross-system-reporting/venv/lib/python3.8/site-packages/pymc/sampling.py in sample(draws, step, init, n_init, initvals, trace, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, **kwargs)
    559     # One final check that shapes and logps at the starting points are okay.
    560     for ip in initial_points:
--> 561         model.check_start_vals(ip)
    562         _check_start_shape(model, ip)
    563 

~/PycharmProjects/cross-system-reporting/venv/lib/python3.8/site-packages/pymc/model.py in check_start_vals(self, start)
   1799 
   1800             if not all(np.isfinite(v) for v in initial_eval.values()):
-> 1801                 raise SamplingError(
   1802                     "Initial evaluation of model at starting point failed!\n"
   1803                     f"Starting values:\n{elem}\n\n"

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'p_exclusive_id_logodds__': array(0.22135405)}

Initial evaluation results:
{'p_exclusive_id': -1.4, 'n_exclusive_id': -inf}

With a large enough number you will eventually fall out of the numerical support of the logp expression of any distribution and start seeing underflow/overflow issues.

Thanks @ricardoV94 any suggestions on how I should handle the situation? This is coming from a use case where we are modeling binomial outcomes for a large number of programmatic ad impressions, so the numbers are really at this scale.

At those scales you can either approximate the binomial with a normal (in which case you can rescale the data as well) or perhaps rescale your binomial data, without much loss of accuracy: Binomial distribution - Wikipedia

1 Like

Thanks @ricardoV94 that seems to work