Specifying hierarchical Binomial model when individuals have varying numbers of trials?

I have a quick question regarding how to properly specify a hierarchical Binomial model, and I feel it is rather basic, but I have not found any relevant info anywhere on this, so I was hoping someone will enlighten me here, which I would appreciate greatly.

When estimating a Binomial model using pm.Binomial(), the option n tells us how many trials, of success or failure, that we have data for, for a single individual. This allows one to calculate the probability p, for example, of a coin landing heads, after we have flipped it n times, for the single, solitary coin.

However, in a hierarchical Binomial model, we could imagine instead having a whole bag of faulty coins (from the same disreputable source), and we wish to estimate their individual p‘s, with a hyperprior allowing each coin’s bias to affect the estimates for the other coins’ biases.

Based on Wiecki’s blogpost here, all this would seem simple, but the n parameter is tripping me up - each individual has a different number of trials, so what should be put down?

My code so far:

import numpy as np
import copy
import pymc3 as pm

## Data generation
coin_data = [np.random.randint(0,2, size=np.random.randint(20,42)) for i in range(5)]

print(coin_data)

coin_data_index = copy.deepcopy(coin_data)
for index, mini_array in enumerate(coin_data_index):
    for i in np.nditer(mini_array, op_flags=['readwrite']):
        mini_array[...] = index

coin_data = np.hstack(coin_data)
coin_data_index = np.hstack(coin_data_index)

with pm.Model() as bag_of_coins_model:
    
    ## Each coin has its own intercept 'a', and 'b' is the global intercept
    length = len( np.unique(coin_data_index) )
    a = pm.Normal('a', 0, 3, shape = length)
    b = pm.Normal('b', 0, 3)
    
    p = pm.math.invlogit(a[coin_data_index] + b)
    
    flipped_heads = pm.Binomial('flipped_heads', n=1, p=p, observed = coin_data)

    trace = pm.sample(2000, tune=2000)

If I set n=length, I get completely different results, so I don’t know which to use. I imagine n should be equal to the number of times each coin is flipped/trials for each individual, but I do not know if that is what Pymc3 is indeed doing internally.

Would someone explain how and why to set n in this example?

n should be a vector with the same length as coin_data, something like [10, 9, 5, 6, …] that each element indicates the total n of that specific observed trial

So, I created an array, coin_range_lens, which is an array [30, 31, 33, 19, 40] with the lengths of trials for each of the 5 coins. This was my code,

import itertools
len_of_each_coin = [np.argwhere(coin_data_index == i).T[0] for i in range(5)]

def to_ranges(iterable):

    for mini_list in iterable:
        if type(mini_list) == np.ndarray:
            mini_list = mini_list.tolist()
        mini_list = sorted(set(mini_list))
        for key, group in itertools.groupby(enumerate(mini_list),
                                            lambda t: t[1] - t[0]):
            group = list(group)
            yield group[0][1], group[-1][1]

coin_ranges = list(to_ranges(len_of_each_coin))
coin_range_lens = [len(range(i[0], i[1])) for i in coin_ranges]

, but using coin_range_lens as n returns a Input dimension mis-match error.

coin_data is for me a flat array with all the trials of all the coins. How could the lengths vector n be as long as coin_data itself, junpenglao?

array([1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 1])

What I meant is something like:

In [10]: nvect, pvect = [], [] 
    ...: for i in np.unique(coin_data_index): 
    ...:     idx = coin_data_index == i 
    ...:     nvect.append(idx.sum()) 
    ...:     pvect.append(coin_data[idx].sum()) 
    ...:                                                                        

In [11]: with pm.Model() as bag_of_coins_model: 
    ...:      
    ...:     ## Each coin has its own intercept 'a', and 'b' is the global inter
    ...: cept 
    ...:     length = len( np.unique(coin_data_index) ) 
    ...:     a = pm.Normal('a', 0, 3, shape = length) 
    ...:     b = pm.Normal('b', 0, 3) 
    ...:      
    ...:     p = pm.math.invlogit(a + b) 
    ...:      
    ...:     flipped_heads = pm.Binomial('flipped_heads', n=nvect, p=p, observed
    ...:  = pvect) 
    ...:  
    ...:     trace = pm.sample(2000, tune=2000) 

But actually what you are doing should be fine with n=1, the different number of trail is accounted for by the accumulation in the Bernoulli trial.

I see what you are doing now. But just to confirm it, you are doing the following:

  1. setting n to be number of trials / individual,
  2. setting observed to be the number of successes / individual,
  3. and removing the brackets, [], around a because the shape and other arguments make sure to not make them universal variables.

But what I did before also does the same thing, despite the syntax having a few key differences?

You are right. And yes the two model is the same.

Thanks a lot, junpenglao! This will definitely clear up a few issues that I will undoubtedly encounter along the road when dealing with more complex models!

1 Like