Implementation of Multivariate Hypergeometric distribution not sampling correctly

lmnice · April 14, 2019, 3:11am

Once again, thanks for the pointers. I checked possible implementation mistakes which you pointed out in 1. and 2.

An error in the logp

So I wrote the logp function for 1000 random values of K and k in both Excel and Python (neither have native implementations of the multivariate version of the hypergeometric distributions), based on this formulation in Wikipedia. I also ran the same values through the implementation that I used in the logp() of the MvHypergeometric distribution. All three matched exactly. So the logp() function seems fine.

An error in random, most of these come from bad shape handling, but as you also wrote random_, then that could be the source of errors.

I did something similar to the above for the random sampling function. I took 10⁵ random draws from MvHypergeometric(K=[35, 43], n=20) (which is equivalent to a bivariate Hypergeometric with ngood = 35 and nbad = 43 since there are only two categories), and compared it to the numpy random generator (numpy.random.hypergeometric) and to another function that implements the hypergeometric RNG in another way. I was then able to generate this graph, comparing the (cumulative) histogram for the three ways of generating hypergeometric-distributed samples:

So it seems the random() function is also working fine. Here is the code to generate the plot:

def compare_random(size_stat=10 ** 4, n_sample=20, n_cats=2, K_max=50):
    K_max = 50

    Ki = range(n_cats)
    K = np.random.randint(low=1, high=K_max + 1, size=n_cats)
    n_actual = min(n_sample, K.sum())

    if n_actual != n_sample:
        print('Actual sample size is {0:n}'.format(n_actual))

    df_cols = ['n'] + ['k_{0:n}'.format(idx + 1) for idx in range(K.shape[-1])]

    df_k_python = pd.DataFrame([], columns=df_cols, index=range(size_stat))
    df_k_np = pd.DataFrame([], columns=df_cols, index=range(size_stat))
    df_k_pymc3 = pd.DataFrame([], columns=df_cols, index=range(size_stat))

    dfs = { 'python': df_k_python,
            'numpy': df_k_np,
            'pymc3': df_k_pymc3
            }

    for s in tqdm(range(size_stat), 'Sample'):
        K_it = K.copy()
        k = np.zeros_like(K)
        for i in range(n_actual):
            rand_idx = np.random.choice(Ki, p=K_it / K_it.sum())
            K_it[rand_idx] -= 1
            k[rand_idx] += 1

        df_k_python.loc[s] = [n_actual] + list(k)

    df_k_np['n'] = df_k_python['n']
    df_k_np['k_1'] = np.random.hypergeometric(ngood=K[0], nbad=K.sum() - K[0], nsample=df_k_python['n'])
    df_k_np['k_2'] = df_k_np['n'] - df_k_np['k_1']

    df_k_pymc3['n'] = df_k_python['n']
    with pm.Model() as ex4:
        hyp = MvHypergeometric('hyp', K=K, n=n_actual, shape=K.shape)
    df_k_pymc3[['k_1', 'k_2']] = hyp.random(size=size_stat)

    # plot
    fig = plt.figure(figsize=(12, 8))
    ax1, ax2 = fig.subplots(nrows=2)

    for n, df in dfs.items():
        sns.distplot(a=df['k_1'], kde=False,
                     hist_kws={'cumulative': True, 'histtype': 'step', "linewidth": 3
                     }, ax=ax1, label=n)
        sns.distplot(a=df['k_2'], kde=False,
                     hist_kws={'cumulative': True, 'histtype': 'step', "linewidth": 3,
                     }, ax=ax2, label=n)

    labeldict = {1: 'good', 2: 'bad'}
    for ax in [ax1, ax2]:
        ax.legend()
        i = int(ax.get_xlabel().split('_')[-1])
        label = labeldict[i]
        ax.set_xlabel(r'$k_' + '{0:n}'.format(i) + r' = k_{' + label + r'}$')
        ax.set_ylabel('Frequency (cumulative)')

    ax1.figure.suptitle('{0:.2e} draws from MvHypergeom with two categories (equal to Hypergeom)'.format(size_stat))

    ax1.set_title(r' Distribution: $\vec{k}\sim MvHypergeometric(n_{good} = ' + \
                   '{0:n}'.format(K[0]) + r', n_{bad} = ' + '{0:n}'.format(K[1]) + r', n = ' + \
                   '{0:n}'.format(n_actual) + r')$')

    ax1.figure.tight_layout(rect=(0, 0, 1, 0.96))

    return [ax1, ax2], dfs

Next I’ll try implementing a simpler version of the MvHypergeometric class (just random() and logp(), simple shape handling) just to see if I can get it to work.

Topic		Replies	Views
Multivariate Hypergeometric model has very low number of effective samples Questions	4	1208	April 22, 2019
Problem with pymc3.Multinomial() distribution? Questions	3	565	July 14, 2020
Multivariate Normal code does not work since version 3.2 Development	4	891	October 30, 2017
Asking help for Multivariate Poisson distribution implement	0	272	May 18, 2023
Mixed multivariate Gauss distribution Questions	39	4812	April 12, 2018

Implementation of Multivariate Hypergeometric distribution not sampling correctly

Related topics