Implementation of Multivariate Hypergeometric distribution not sampling correctly

Once again, thanks for the pointers. I checked possible implementation mistakes which you pointed out in 1. and 2.

  1. An error in the logp

So I wrote the logp function for 1000 random values of K and k in both Excel and Python (neither have native implementations of the multivariate version of the hypergeometric distributions), based on this formulation in Wikipedia. I also ran the same values through the implementation that I used in the logp() of the MvHypergeometric distribution. All three matched exactly. So the logp() function seems fine.

  1. An error in random, most of these come from bad shape handling, but as you also wrote random_, then that could be the source of errors.

I did something similar to the above for the random sampling function. I took 105 random draws from MvHypergeometric(K=[35, 43], n=20) (which is equivalent to a bivariate Hypergeometric with ngood = 35 and nbad = 43 since there are only two categories), and compared it to the numpy random generator (numpy.random.hypergeometric) and to another function that implements the hypergeometric RNG in another way. I was then able to generate this graph, comparing the (cumulative) histogram for the three ways of generating hypergeometric-distributed samples:

So it seems the random() function is also working fine. Here is the code to generate the plot:

def compare_random(size_stat=10 ** 4, n_sample=20, n_cats=2, K_max=50):
    K_max = 50

    Ki = range(n_cats)
    K = np.random.randint(low=1, high=K_max + 1, size=n_cats)
    n_actual = min(n_sample, K.sum())

    if n_actual != n_sample:
        print('Actual sample size is {0:n}'.format(n_actual))

    df_cols = ['n'] + ['k_{0:n}'.format(idx + 1) for idx in range(K.shape[-1])]

    df_k_python = pd.DataFrame([], columns=df_cols, index=range(size_stat))
    df_k_np = pd.DataFrame([], columns=df_cols, index=range(size_stat))
    df_k_pymc3 = pd.DataFrame([], columns=df_cols, index=range(size_stat))

    dfs = { 'python': df_k_python,
            'numpy': df_k_np,
            'pymc3': df_k_pymc3
            }

    for s in tqdm(range(size_stat), 'Sample'):
        K_it = K.copy()
        k = np.zeros_like(K)
        for i in range(n_actual):
            rand_idx = np.random.choice(Ki, p=K_it / K_it.sum())
            K_it[rand_idx] -= 1
            k[rand_idx] += 1

        df_k_python.loc[s] = [n_actual] + list(k)

    df_k_np['n'] = df_k_python['n']
    df_k_np['k_1'] = np.random.hypergeometric(ngood=K[0], nbad=K.sum() - K[0], nsample=df_k_python['n'])
    df_k_np['k_2'] = df_k_np['n'] - df_k_np['k_1']

    df_k_pymc3['n'] = df_k_python['n']
    with pm.Model() as ex4:
        hyp = MvHypergeometric('hyp', K=K, n=n_actual, shape=K.shape)
    df_k_pymc3[['k_1', 'k_2']] = hyp.random(size=size_stat)

    # plot
    fig = plt.figure(figsize=(12, 8))
    ax1, ax2 = fig.subplots(nrows=2)

    for n, df in dfs.items():
        sns.distplot(a=df['k_1'], kde=False,
                     hist_kws={'cumulative': True, 'histtype': 'step', "linewidth": 3
                     }, ax=ax1, label=n)
        sns.distplot(a=df['k_2'], kde=False,
                     hist_kws={'cumulative': True, 'histtype': 'step', "linewidth": 3,
                     }, ax=ax2, label=n)

    labeldict = {1: 'good', 2: 'bad'}
    for ax in [ax1, ax2]:
        ax.legend()
        i = int(ax.get_xlabel().split('_')[-1])
        label = labeldict[i]
        ax.set_xlabel(r'$k_' + '{0:n}'.format(i) + r' = k_{' + label + r'}$')
        ax.set_ylabel('Frequency (cumulative)')

    ax1.figure.suptitle('{0:.2e} draws from MvHypergeom with two categories (equal to Hypergeom)'.format(size_stat))

    ax1.set_title(r' Distribution: $\vec{k}\sim MvHypergeometric(n_{good} = ' + \
                   '{0:n}'.format(K[0]) + r', n_{bad} = ' + '{0:n}'.format(K[1]) + r', n = ' + \
                   '{0:n}'.format(n_actual) + r')$')

    ax1.figure.tight_layout(rect=(0, 0, 1, 0.96))

    return [ax1, ax2], dfs

Next I’ll try implementing a simpler version of the MvHypergeometric class (just random() and logp(), simple shape handling) just to see if I can get it to work.

2 Likes