Hello,
I implemented a Multivariate Hypergeometric RV using PyMC3, but the posterior traces in a toy model have a low number of effective samples; I’m hoping y’all might give me a few pointers on how to improve the converge characteristics, either by refactoring the distribution implementation or by reparameterizing the toy model.
The distribution is implemented as follows:
from pymc3.distributions import Discrete, draw_values, generate_samples
from scipy.stats import hypergeom
import theano.tensor as tt
from pymc3.util import get_variable_name
from pymc3.distributions.dist_math import bound, binomln
from pymc3.math import tround
import pymc3 as pm
import numpy as np
class MvHypergeometric(Discrete):
def __init__(self, K, n, *args, **kwargs):
super(MvHypergeometric, self).__init__(*args, **kwargs)
N = tt.sum(K, axis=-1, keepdims=True)
n = np.squeeze(n) # works also if n is a
self.N = N
if len(self.shape) > 1:
m = self.shape[-2]
# try:
# assert n.shape == (m,)
# except (AttributeError, AssertionError):
# n = n * tt.ones(m)
self.n = tt.shape_padright(n)
self.K = K if K.ndim > 1 else tt.shape_padleft(K)
elif n.ndim == 1:
self.n = tt.shape_padright(n)
self.K = K if K.ndim > 1 else tt.shape_padleft(K)
else:
# n is a scalar, K is a 1d array
self.n = tt.as_tensor_variable(n)
self.K = tt.as_tensor_variable(K)
self.mean = self.n * self.K / self.N
mode = tt.cast(self.mean, 'int32')
diff = self.n - tt.sum(mode, axis=-1, keepdims=True)
inc_bool_arr = tt.abs_(diff) > 0
mode = tt.inc_subtensor(mode[inc_bool_arr.nonzero()],
diff[inc_bool_arr.nonzero()])
self.mode = mode
def random(self, point=None, size=None):
n, K = draw_values([self.n, self.K], point=point, size=size)
samples = random_mv_hypergeom(K=K, n=n, size=size)
return samples
def logp(self, x):
n = self.n
K = self.K
N = self.N
return bound(
tt.sum(binomln(K, x)) - binomln(N,tt.sum(x, axis=-1, keepdims=True)),
tt.all(x >= 0),
tt.all(tt.eq(tt.sum(x, axis=-1, keepdims=True), n)),
tt.all(K >= 0),
tt.all(tt.eq(tt.sum(K), N)),
tt.all(tt.le(n, N)),
broadcast_conditions=False
)
def _repr_latex_(self, name=None, dist=None):
if dist is None:
dist = self
n = dist.n
K = dist.K
return r'${} \sim \text{{MVHypergeometric}}*(\mathit{{K}}={}, \mathit{{n}}={})$'.format(
name,
get_variable_name(K),
get_variable_name(n))
def random_mv_hypergeom(K, n, size=1):
"""
Parameters
----------
K : one-dimensional array containing the total amount of each category in the population
n : sample size
Returns
-------
One-dimensional array with the same length as `K` containing a random sample of the
population
"""
if size is None:
size = 1
if not isinstance(size, int):
size = size[0]
n = np.repeat(a=n, repeats=size)
remaining = np.cumsum(K[::-1])[::-1]
result = np.zeros((len(K), size), dtype=np.int)
for i, k in enumerate(K[:-1]):
n_true = n[:]
ignore_results = n<1
n[n<1] = 0
result[i] = np.random.hypergeometric(k, remaining[i + 1], n)
n -= result[i]
result[i, ignore_results] = 0
n = n_true[:]
result[-1] = n
return result.transpose()
I am currently using it to implement the ordered-probit model from Kruschke’s DBDA book (chapter 23). Instead of using a Multinomial
RV to describe each ordinal category count, I’d like to use a MvHypergeometric
instead, since in my particular application the sample size is large compared to the population size.
The toy model I used to develop the MvHypergeometric
RV tries to estimate the size of each category K
in the population (population size N = 180
) based on a sample of n = 162
members of this population. A good example of this is estimating the number of green, black, and red balls in a urn with 180 balls by examining 162 of them.
I model the size of each category in the sample, k
, as a count (hyp
) of each of three categories, where hyp \sim \textrm{MvH}(N = 180, K = \vec{KK}, n = 162). hyp
is observed to have a random sample from the MvHypergeometric
distribution with the true values of K
(what we are trying to estimate) of trueK = [40,60,80]
. One possible value of k
would be, say, k = [35, 54, 73]
.
KK
is itself a Multinomial
RV: KK \sim \textrm{Multinomial}(N = 180, p = \vec{pp}). Since KK
must sum up to N
, I set N = 180
, but we don’t know a priori how the categories are distributed, so we have to model the p
parameter.
pp
is modeled as a Dirichlet
RV: pp \sim \textrm{Dirichlet}(a = [1/3, 1/3, 1/3]). To help with convergence, I set testval
to the observed probability of each category (testval = obs/n
).
from pymc3.distributions import Discrete, draw_values, generate_samples
from scipy.stats import hypergeom
import theano.tensor as tt
from pymc3.util import get_variable_name
from pymc3.distributions.dist_math import bound, binomln
from pymc3.math import tround
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.special as spsp
import pandas as pd
from tqdm import tqdm
sns.set()
if __name__ == '__main__':
trueK = np.asarray([40, 60, 80])
N = trueK.sum()
fn = 0.9
n = (N*fn).astype('int32')
obs = random_mv_hypergeom(K=trueK, n=n)[0]
p = obs/n
diffusiveness = np.int32(1)
with pm.Model() as ex:
pp = pm.Dirichlet('p', a = (1/diffusiveness)*N*np.repeat(1/len(trueK), repeats=len(trueK)), shape=trueK.shape,
testval=(1/diffusiveness.astype('int'))*obs)
KK = pm.Multinomial('K', n=N, p=pp, shape=trueK.shape)
hyp = MvHypergeometric(name='k', K=KK, n=n,
observed=obs)
tex1 = pm.sample(draws=10000)
When I run this model, I get some warnings about the effective sample size:
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>NUTS: [p]
>Metropolis: [K]
Sampling 4 chains: 100%|██████████| 42000/42000 [00:20<00:00, 2011.57draws/s]
The number of effective samples is smaller than 10% for some parameters.
Looking at the trace’s pm.summary()
, the ESS’s are indeed very small, but only for the K
s, not for the p
s:
In[5]: pm.summary(tex1)
Out[5]:
mean sd mc_error ... hpd_97.5 n_eff Rhat
K__0 44.149725 1.940504 0.113412 ... 47.000000 242.410104 1.008109
K__1 60.221650 1.999105 0.108159 ... 64.000000 275.596334 1.013769
K__2 75.628625 2.146522 0.124841 ... 79.000000 239.609046 1.021091
p__0 0.289355 0.024403 0.000324 ... 0.337339 6544.468395 1.000423
p__1 0.333925 0.025245 0.000306 ... 0.384017 7342.134394 1.000459
p__2 0.376720 0.026256 0.000364 ... 0.427981 5043.817737 1.000888
The traces are also strange: while the K
s seemed to mix poorly (there seems to be a lot of autocorrelation in the K
traces), the true values of [40, 60, 80]
are covered by the respective HPDs. However, the p
s seemed to mix a lot better:
This seems strange, since the model goes pp \rightarrow KK \rightarrow hyp; I’d think that, since hyp
is observed, if KK
sampled poorly, so would pp
. That is clearly not the case, though.
The p
s also aren’t compatible with the K
s. For example, in the above trace, K_1
's 95% HPD is [57, 64]
, which would correspond to a [0.316667, 0.355556]
95% HPD on the proportion p_1
. However, p_1
's actual 95% HPD is [0.285459, 0.384017]
, which is much wider.
This gets worse the closer the sample size is to the population size (the ration between the sample size and the population size is called fn
in the code). Running the same model with fn = 0.99
, it gives K_1
's 95% HPD as [59, 61]
, which would correspond to a [0.327778, 0.338889]
95% HPD on the proportion p_1
. p_1
's actual 95% HPD is [0.285132, 0.382066]
, which is even wider in comparison. Note that, even though the mixing is terrible (as indicated by the number of effective samples as well as the appearance of the trace), the posterior distribution manages to cover the true value (in this case, trueK[1] == 60
).
KK
also gets poorer and poorer mixing as the sample size is smaller in proportion to the population size. In theory, as fn
gets smaller, the MvHypergeometric
should converge to a Multinomial
with equal sample size (n_multinomial = n_mvh
) and p_multinomial = K_mvh / N_mvh
.
Finally, in all of the tests with different values of fn
and trueK
, even the obviously very poorly mixed posteriors seemed to cover the true values.
I’ve done some tests (in another post in the PyMC3 Discourse) comparing the logp()
of this implementation of the MvHypergeometric
with other implementations, with positive results. I also compared random()
's output with others in the same post, also with positive results.
I’ve now spent the better part of a week thinking about this and messing with the code, but I wasn’t able to get anywhere further. Do y’all have any insights to share?