I think that the most likely cause of the low ess is the metropolis step. Metropolis steps or Gibbs are the only ways to sample from discrete variables such as K
, but the problem is that it mixes badly. It’s even worse when the number of dimensions (in your case, the number of classes) increases. This means that the autocorrelation in the trace for K
will be long. This can be interpreted to mean that given an element in the trace, the next step will have not reached the stable distribution of the Markov Chain yet. A typical way to deal with this problem is to thin the chain after sampling, in order to make each element in the thinned chain have little to no autocorrelation with the previous one. The downside is that you will have to draw much more samples than the ones you will end up using. In your case maybe thinning by a factor of 10 could work, but you should have a look at the autocorrelation plots.
The only alternative I can think of, to avoid metropolis, is to try to marginalize out K
. This means that you should write down the probability distribution of your MvHypergeometric
conditional on K
and then sum over all possible combinations of K
. This will lead to a Mixture
of MvHypergeometric
s with different determined K
’s, so you would not be inferring discrete values anymore, you would be left only with the continuous p
and could sample using NUTS. However, mixtures are also very hard to sample from, so I would go with chain thinning first, and see how it goes.