Once again, thanks for the pointers. I checked possible implementation mistakes which you pointed out in 1. and 2.
- An error in the logp
So I wrote the logp function for 1000 random values of K
and k
in both Excel and Python (neither have native implementations of the multivariate version of the hypergeometric distributions), based on this formulation in Wikipedia. I also ran the same values through the implementation that I used in the logp()
of the MvHypergeometric
distribution. All three matched exactly. So the logp()
function seems fine.
- An error in random, most of these come from bad shape handling, but as you also wrote random_, then that could be the source of errors.
I did something similar to the above for the random sampling function. I took 105 random draws from MvHypergeometric(K=[35, 43], n=20)
(which is equivalent to a bivariate Hypergeometric with ngood = 35
and nbad = 43
since there are only two categories), and compared it to the numpy random generator (numpy.random.hypergeometric
) and to another function that implements the hypergeometric RNG in another way. I was then able to generate this graph, comparing the (cumulative) histogram for the three ways of generating hypergeometric-distributed samples:
So it seems the random()
function is also working fine. Here is the code to generate the plot:
def compare_random(size_stat=10 ** 4, n_sample=20, n_cats=2, K_max=50):
K_max = 50
Ki = range(n_cats)
K = np.random.randint(low=1, high=K_max + 1, size=n_cats)
n_actual = min(n_sample, K.sum())
if n_actual != n_sample:
print('Actual sample size is {0:n}'.format(n_actual))
df_cols = ['n'] + ['k_{0:n}'.format(idx + 1) for idx in range(K.shape[-1])]
df_k_python = pd.DataFrame([], columns=df_cols, index=range(size_stat))
df_k_np = pd.DataFrame([], columns=df_cols, index=range(size_stat))
df_k_pymc3 = pd.DataFrame([], columns=df_cols, index=range(size_stat))
dfs = { 'python': df_k_python,
'numpy': df_k_np,
'pymc3': df_k_pymc3
}
for s in tqdm(range(size_stat), 'Sample'):
K_it = K.copy()
k = np.zeros_like(K)
for i in range(n_actual):
rand_idx = np.random.choice(Ki, p=K_it / K_it.sum())
K_it[rand_idx] -= 1
k[rand_idx] += 1
df_k_python.loc[s] = [n_actual] + list(k)
df_k_np['n'] = df_k_python['n']
df_k_np['k_1'] = np.random.hypergeometric(ngood=K[0], nbad=K.sum() - K[0], nsample=df_k_python['n'])
df_k_np['k_2'] = df_k_np['n'] - df_k_np['k_1']
df_k_pymc3['n'] = df_k_python['n']
with pm.Model() as ex4:
hyp = MvHypergeometric('hyp', K=K, n=n_actual, shape=K.shape)
df_k_pymc3[['k_1', 'k_2']] = hyp.random(size=size_stat)
# plot
fig = plt.figure(figsize=(12, 8))
ax1, ax2 = fig.subplots(nrows=2)
for n, df in dfs.items():
sns.distplot(a=df['k_1'], kde=False,
hist_kws={'cumulative': True, 'histtype': 'step', "linewidth": 3
}, ax=ax1, label=n)
sns.distplot(a=df['k_2'], kde=False,
hist_kws={'cumulative': True, 'histtype': 'step', "linewidth": 3,
}, ax=ax2, label=n)
labeldict = {1: 'good', 2: 'bad'}
for ax in [ax1, ax2]:
ax.legend()
i = int(ax.get_xlabel().split('_')[-1])
label = labeldict[i]
ax.set_xlabel(r'$k_' + '{0:n}'.format(i) + r' = k_{' + label + r'}$')
ax.set_ylabel('Frequency (cumulative)')
ax1.figure.suptitle('{0:.2e} draws from MvHypergeom with two categories (equal to Hypergeom)'.format(size_stat))
ax1.set_title(r' Distribution: $\vec{k}\sim MvHypergeometric(n_{good} = ' + \
'{0:n}'.format(K[0]) + r', n_{bad} = ' + '{0:n}'.format(K[1]) + r', n = ' + \
'{0:n}'.format(n_actual) + r')$')
ax1.figure.tight_layout(rect=(0, 0, 1, 0.96))
return [ax1, ax2], dfs
Next I’ll try implementing a simpler version of the MvHypergeometric
class (just random()
and logp()
, simple shape handling) just to see if I can get it to work.