# Specifying hierarchical Binomial model when individuals have varying numbers of trials?

I have a quick question regarding how to properly specify a hierarchical Binomial model, and I feel it is rather basic, but I have not found any relevant info anywhere on this, so I was hoping someone will enlighten me here, which I would appreciate greatly.

When estimating a Binomial model using `pm.Binomial()`, the option `n` tells us how many trials, of success or failure, that we have data for, for a single individual. This allows one to calculate the probability `p`, for example, of a coin landing heads, after we have flipped it `n` times, for the single, solitary coin.

However, in a hierarchical Binomial model, we could imagine instead having a whole bag of faulty coins (from the same disreputable source), and we wish to estimate their individual `p`‘s, with a hyperprior allowing each coin’s bias to affect the estimates for the other coins’ biases.

Based on Wiecki’s blogpost here, all this would seem simple, but the `n` parameter is tripping me up - each individual has a different number of trials, so what should be put down?

My code so far:

``````import numpy as np
import copy
import pymc3 as pm

## Data generation
coin_data = [np.random.randint(0,2, size=np.random.randint(20,42)) for i in range(5)]

print(coin_data)

coin_data_index = copy.deepcopy(coin_data)
for index, mini_array in enumerate(coin_data_index):
mini_array[...] = index

coin_data = np.hstack(coin_data)
coin_data_index = np.hstack(coin_data_index)

with pm.Model() as bag_of_coins_model:

## Each coin has its own intercept 'a', and 'b' is the global intercept
length = len( np.unique(coin_data_index) )
a = pm.Normal('a', 0, 3, shape = length)
b = pm.Normal('b', 0, 3)

p = pm.math.invlogit(a[coin_data_index] + b)

trace = pm.sample(2000, tune=2000)
``````

If I set `n=length`, I get completely different results, so I don’t know which to use. I imagine `n` should be equal to the number of times each coin is flipped/trials for each individual, but I do not know if that is what Pymc3 is indeed doing internally.

Would someone explain how and why to set `n` in this example?

n should be a vector with the same length as `coin_data`, something like [10, 9, 5, 6, …] that each element indicates the total n of that specific observed trial

So, I created an array, `coin_range_lens`, which is an array `[30, 31, 33, 19, 40]` with the lengths of trials for each of the 5 coins. This was my code,

``````import itertools
len_of_each_coin = [np.argwhere(coin_data_index == i).T for i in range(5)]

def to_ranges(iterable):

for mini_list in iterable:
if type(mini_list) == np.ndarray:
mini_list = mini_list.tolist()
mini_list = sorted(set(mini_list))
for key, group in itertools.groupby(enumerate(mini_list),
lambda t: t - t):
group = list(group)
yield group, group[-1]

coin_ranges = list(to_ranges(len_of_each_coin))
coin_range_lens = [len(range(i, i)) for i in coin_ranges]
``````

, but using `coin_range_lens` as `n` returns a `Input dimension mis-match` error.

`coin_data` is for me a flat array with all the trials of all the coins. How could the lengths vector `n` be as long as `coin_data` itself, junpenglao?

``````array([1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,
0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
0, 1, 1, 1])``````

What I meant is something like:

``````In : nvect, pvect = [], []
...: for i in np.unique(coin_data_index):
...:     idx = coin_data_index == i
...:     nvect.append(idx.sum())
...:     pvect.append(coin_data[idx].sum())
...:

In : with pm.Model() as bag_of_coins_model:
...:
...:     ## Each coin has its own intercept 'a', and 'b' is the global inter
...: cept
...:     length = len( np.unique(coin_data_index) )
...:     a = pm.Normal('a', 0, 3, shape = length)
...:     b = pm.Normal('b', 0, 3)
...:
...:     p = pm.math.invlogit(a + b)
...:
...:  = pvect)
...:
...:     trace = pm.sample(2000, tune=2000)

``````

But actually what you are doing should be fine with `n=1`, the different number of trail is accounted for by the accumulation in the Bernoulli trial.

I see what you are doing now. But just to confirm it, you are doing the following:

1. setting `n` to be number of trials / individual,
2. setting `observed` to be the number of successes / individual,
3. and removing the brackets, `[]`, around `a` because the `shape` and other arguments make sure to not make them universal variables.

But what I did before also does the same thing, despite the syntax having a few key differences?

You are right. And yes the two model is the same.

Thanks a lot, junpenglao! This will definitely clear up a few issues that I will undoubtedly encounter along the road when dealing with more complex models!

1 Like