Multivariate Hypergeometric model has very low number of effective samples


I implemented a Multivariate Hypergeometric RV using PyMC3, but the posterior traces in a toy model have a low number of effective samples; I’m hoping y’all might give me a few pointers on how to improve the converge characteristics, either by refactoring the distribution implementation or by reparameterizing the toy model.

The distribution is implemented as follows:

from pymc3.distributions import Discrete, draw_values, generate_samples
from scipy.stats import hypergeom
import theano.tensor as tt
from pymc3.util import get_variable_name
from pymc3.distributions.dist_math import bound, binomln
from pymc3.math import tround
import pymc3 as pm
import numpy as np

class MvHypergeometric(Discrete):
    def __init__(self, K, n, *args, **kwargs):
        super(MvHypergeometric, self).__init__(*args, **kwargs)

        N = tt.sum(K, axis=-1, keepdims=True)
        n = np.squeeze(n) # works also if n is a

        self.N = N
        if len(self.shape) > 1:
            m = self.shape[-2]
            # try:
            #     assert n.shape == (m,)
            # except (AttributeError, AssertionError):
            #     n = n * tt.ones(m)
            self.n = tt.shape_padright(n)
            self.K = K if K.ndim > 1 else tt.shape_padleft(K)
        elif n.ndim == 1:
            self.n = tt.shape_padright(n)
            self.K = K if K.ndim > 1 else tt.shape_padleft(K)
            # n is a scalar, K is a 1d array
            self.n = tt.as_tensor_variable(n)
            self.K = tt.as_tensor_variable(K)

        self.mean = self.n * self.K / self.N
        mode = tt.cast(self.mean, 'int32')
        diff = self.n - tt.sum(mode, axis=-1, keepdims=True)
        inc_bool_arr = tt.abs_(diff) > 0
        mode = tt.inc_subtensor(mode[inc_bool_arr.nonzero()],
        self.mode = mode

    def random(self, point=None, size=None):
        n, K = draw_values([self.n, self.K], point=point, size=size)
        samples = random_mv_hypergeom(K=K, n=n, size=size)
        return samples

    def logp(self, x):
        n = self.n
        K = self.K
        N = self.N

        return bound(
            tt.sum(binomln(K, x)) - binomln(N,tt.sum(x, axis=-1, keepdims=True)),
            tt.all(x >= 0),
            tt.all(tt.eq(tt.sum(x, axis=-1, keepdims=True), n)),
            tt.all(K >= 0),
            tt.all(tt.eq(tt.sum(K), N)),
            tt.all(tt.le(n, N)),

    def _repr_latex_(self, name=None, dist=None):
        if dist is None:
            dist = self
        n = dist.n
        K = dist.K
        return r'${} \sim \text{{MVHypergeometric}}*(\mathit{{K}}={}, \mathit{{n}}={})$'.format(

def random_mv_hypergeom(K, n, size=1):
    K : one-dimensional array containing the total amount of each category in the population
    n : sample size

    One-dimensional array with the same length as `K` containing a random sample of the

    if size is None:
        size = 1

    if not isinstance(size, int):
        size = size[0]

    n = np.repeat(a=n, repeats=size)

    remaining = np.cumsum(K[::-1])[::-1]
    result = np.zeros((len(K), size),
    for i, k in enumerate(K[:-1]):
        n_true = n[:]
        ignore_results = n<1
        n[n<1] = 0
        result[i] = np.random.hypergeometric(k, remaining[i + 1], n)
        n -= result[i]
        result[i, ignore_results] = 0
        n = n_true[:]

    result[-1] = n
    return result.transpose()

I am currently using it to implement the ordered-probit model from Kruschke’s DBDA book (chapter 23). Instead of using a Multinomial RV to describe each ordinal category count, I’d like to use a MvHypergeometric instead, since in my particular application the sample size is large compared to the population size.

The toy model I used to develop the MvHypergeometric RV tries to estimate the size of each category K in the population (population size N = 180) based on a sample of n = 162 members of this population. A good example of this is estimating the number of green, black, and red balls in a urn with 180 balls by examining 162 of them.

I model the size of each category in the sample, k, as a count (hyp) of each of three categories, where hyp \sim \textrm{MvH}(N = 180, K = \vec{KK}, n = 162). hyp is observed to have a random sample from the MvHypergeometric distribution with the true values of K (what we are trying to estimate) of trueK = [40,60,80]. One possible value of k would be, say, k = [35, 54, 73].

KK is itself a Multinomial RV: KK \sim \textrm{Multinomial}(N = 180, p = \vec{pp}). Since KK must sum up to N, I set N = 180, but we don’t know a priori how the categories are distributed, so we have to model the p parameter.

pp is modeled as a Dirichlet RV: pp \sim \textrm{Dirichlet}(a = [1/3, 1/3, 1/3]). To help with convergence, I set testval to the observed probability of each category (testval = obs/n).

from pymc3.distributions import Discrete, draw_values, generate_samples
from scipy.stats import hypergeom
import theano.tensor as tt
from pymc3.util import get_variable_name
from pymc3.distributions.dist_math import bound, binomln
from pymc3.math import tround
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.special as spsp
import pandas as pd
from tqdm import tqdm


if __name__ == '__main__':

    trueK = np.asarray([40, 60, 80])
    N = trueK.sum()

    fn = 0.9
    n = (N*fn).astype('int32')
    obs = random_mv_hypergeom(K=trueK, n=n)[0]
    p = obs/n

    diffusiveness = np.int32(1)

    with pm.Model() as ex:
        pp = pm.Dirichlet('p', a = (1/diffusiveness)*N*np.repeat(1/len(trueK), repeats=len(trueK)), shape=trueK.shape,
        KK = pm.Multinomial('K', n=N, p=pp, shape=trueK.shape)
        hyp = MvHypergeometric(name='k', K=KK, n=n,
        tex1 = pm.sample(draws=10000)

When I run this model, I get some warnings about the effective sample size:

Multiprocess sampling (4 chains in 4 jobs)
>NUTS: [p]
>Metropolis: [K]
Sampling 4 chains: 100%|██████████| 42000/42000 [00:20<00:00, 2011.57draws/s]
The number of effective samples is smaller than 10% for some parameters.

Looking at the trace’s pm.summary(), the ESS’s are indeed very small, but only for the Ks, not for the ps:

In[5]: pm.summary(tex1)
           mean        sd  mc_error  ...   hpd_97.5        n_eff      Rhat
K__0  44.149725  1.940504  0.113412  ...  47.000000   242.410104  1.008109
K__1  60.221650  1.999105  0.108159  ...  64.000000   275.596334  1.013769
K__2  75.628625  2.146522  0.124841  ...  79.000000   239.609046  1.021091
p__0   0.289355  0.024403  0.000324  ...   0.337339  6544.468395  1.000423
p__1   0.333925  0.025245  0.000306  ...   0.384017  7342.134394  1.000459
p__2   0.376720  0.026256  0.000364  ...   0.427981  5043.817737  1.000888

The traces are also strange: while the Ks seemed to mix poorly (there seems to be a lot of autocorrelation in the K traces), the true values of [40, 60, 80] are covered by the respective HPDs. However, the ps seemed to mix a lot better:

This seems strange, since the model goes pp \rightarrow KK \rightarrow hyp; I’d think that, since hyp is observed, if KK sampled poorly, so would pp. That is clearly not the case, though.

The ps also aren’t compatible with the Ks. For example, in the above trace, K_1's 95% HPD is [57, 64], which would correspond to a [0.316667, 0.355556] 95% HPD on the proportion p_1. However, p_1's actual 95% HPD is [0.285459, 0.384017], which is much wider.

This gets worse the closer the sample size is to the population size (the ration between the sample size and the population size is called fn in the code). Running the same model with fn = 0.99, it gives K_1's 95% HPD as [59, 61], which would correspond to a [0.327778, 0.338889] 95% HPD on the proportion p_1. p_1's actual 95% HPD is [0.285132, 0.382066], which is even wider in comparison. Note that, even though the mixing is terrible (as indicated by the number of effective samples as well as the appearance of the trace), the posterior distribution manages to cover the true value (in this case, trueK[1] == 60).

KK also gets poorer and poorer mixing as the sample size is smaller in proportion to the population size. In theory, as fn gets smaller, the MvHypergeometric should converge to a Multinomial with equal sample size (n_multinomial = n_mvh) and p_multinomial = K_mvh / N_mvh.

Finally, in all of the tests with different values of fn and trueK, even the obviously very poorly mixed posteriors seemed to cover the true values.

I’ve done some tests (in another post in the PyMC3 Discourse) comparing the logp() of this implementation of the MvHypergeometric with other implementations, with positive results. I also compared random()'s output with others in the same post, also with positive results.

I’ve now spent the better part of a week thinking about this and messing with the code, but I wasn’t able to get anywhere further. Do y’all have any insights to share?

I think that the most likely cause of the low ess is the metropolis step. Metropolis steps or Gibbs are the only ways to sample from discrete variables such as K, but the problem is that it mixes badly. It’s even worse when the number of dimensions (in your case, the number of classes) increases. This means that the autocorrelation in the trace for K will be long. This can be interpreted to mean that given an element in the trace, the next step will have not reached the stable distribution of the Markov Chain yet. A typical way to deal with this problem is to thin the chain after sampling, in order to make each element in the thinned chain have little to no autocorrelation with the previous one. The downside is that you will have to draw much more samples than the ones you will end up using. In your case maybe thinning by a factor of 10 could work, but you should have a look at the autocorrelation plots.

The only alternative I can think of, to avoid metropolis, is to try to marginalize out K. This means that you should write down the probability distribution of your MvHypergeometric conditional on K and then sum over all possible combinations of K. This will lead to a Mixture of MvHypergeometrics with different determined K's, so you would not be inferring discrete values anymore, you would be left only with the continuous p and could sample using NUTS. However, mixtures are also very hard to sample from, so I would go with chain thinning first, and see how it goes.

1 Like

Thanks for the insight. I’ll try thinning first, as you suggested.

As for marginalizing out K: do you think this is a problem with this particular model, or the implementation of MvHypergeometric in general? In other words, if thinning is unfeasible (due to the number of samples required), should I try to implement the distribution in such a way that it automatically marginalizes out K?

I’m asking this because I plan on submitting this distribution (as well as the bivariate Hypergeometric) as a PR on GitHub. If this issue with the high autocorrelation might show up on any model that uses this distribution, maybe it’s worth it to implement the distribution differently, yes?

I tried thinning the trace, and it apparently worked:

Multiprocess sampling (4 chains in 4 jobs)
>NUTS: [p]
>Metropolis: [K]
Sampling 4 chains: 100%|██████████| 402000/402000 [02:25<00:00, 2767.98draws/s]
The number of effective samples is smaller than 10% for some parameters.

In[3]: pm.summary(tex1)
           mean        sd  mc_error  ...   hpd_97.5         n_eff      Rhat
K__0  38.920828  1.916841  0.039850  ...  42.000000   2504.700201  1.000392
K__1  60.998500  2.055525  0.046910  ...  65.000000   1974.087224  1.000406
K__2  80.080673  2.105018  0.051597  ...  84.000000   2017.003247  1.000479
p__0   0.274843  0.024075  0.000114  ...   0.321911  49930.290211  1.000024
p__1   0.336110  0.025563  0.000137  ...   0.386465  41332.624531  1.000001
p__2   0.389047  0.026263  0.000147  ...   0.439983  43258.001858  1.000029

The trace for fn = 0.9 look OK, and the number of effective samples increase to an acceptable value. It took about 7 times longer to compile because there were 10 times more samples in the trace, but for the toy model it’s still acceptable. I’ll try it with a more realistic model next.

I’ll mark you first answer as the solution, and I’ll post the results with the realistic model when I get to it. Thanks a lot!


Thanks so much for intending to send this as a PR!

I don’t think that your distribution’s parametrization is wrong because it has discrete random variables, that’s just the way it is. Sampling discrete RVs has the difficulty of not being tractable with HMC because there are no gradients, so all discrete distributions suffer from the downsides of the Metropolis step method. With luck, in near future a novel sampler that mixes well discrete RVs will eventually come out, and will make everything better. There are some papers on the topic.

1 Like