Data representation for mixtures of multinomials, and Categorical vs Mixture?

#1

I’m trying to model my data as a mixture of two multinomials, where one component has been previously estimated, and I’m trying to infer the remaining component (comp_1 below). I’m wondering if my model specification is leading me to a very memory-inefficient representation. The model looks something like:

    w_1 = pm.Dirichlet('w_1', a=prior, shape=numBins)
    comp_1 = pm.Multinomial('c_1', p=w_1, n=1, shape=numBins).distribution
    comp_2 = pm.Multinomial('c_2', p=previously_estimated, n=1, shape=numBins).distribution
    weights = [0.1, 0.9]
    
    x = pm.Mixture('x' , comp_dists = [comp_1, comp_2], w = weights, observed = observations )            
    trace = pm.sample(3000, tune = 500)

On a high level, the observations can be summarized as a vector of length numBins with a count in each, like this for a 3-bin model:

[ 0, 1, 2 ]

As I understand, setting n=1 in the multinomial specification ensures that each observed count in my input data is free to come from either mixture component. But this means that the input data must be specified as a list of observation vectors, each of length numBins, and each having sum 1. For numBins = 3 the above observed data would look something like:

[ [0, 0, 1], [0, 0, 1], [0, 1, 0] ]

So my question is - are there technical and/or statistical limits I’m putting on this model by encoding it this way, with Multinomial n=1? In practice, it seems like the RAM required to load a set of observations into a model is orders of magnitude more than they occupy outside of pymc3.

Second, and possibly unrelated - is it better to encode the source component using pm.Categorical rather than specifying the mixture with pm.Mixture? Something like what’s presented in https://docs.pymc.io/notebooks/gaussian_mixture_model.html?

category = pm.Categorical('category',
                              p=p,
                              shape=ndata)

When is pm.Category preferred to using pm.Mixture?

#2

OK - in a quasi-related way, is it better in this case to encode the individual observations as bin-valued Categorical variables, rather than multinomials? It seems if I do this I can specify the observations from above as (zero-based):

[2, 2, 1] 

Sorry that I’m asking about two different uses of the Categorical class - one to encode mixture components and the other for the Multinomial( n =1 ) case. Hopefully my question(s) make sense - let me know and I can clarify.

#3

In my opinion yes, it is better to use categorical instead of multinomial. And yes, you use zero based ordering.

About memory consumption, multinomial indeed uses more memory because observations are (N, len(p)) instead of (N,). Furthermore, pymc3 uses up more memory because of two reasons:

  1. If you run with more than a single core, multiprocessing copies the model around (with all its observations) into each process that runs the sampling.
  2. When compiling the model’s logp there are some cloning involved which may produce copies in memory, and not just views of other memory addresses.

If you are memory constrained, use the categorical distribution instead of the multinomial and sample with less cores.

Regardless, I imagine it will be very hard to get good mixing of two discrete distributions. I strongly recommend that you stick with the mixture distribution and not model the latent class as a categorical

1 Like
#4

Thanks! Very helpful

#5

@lucianopaz and others - related question: is it possible to specify the observed data as a collection of counts, rather than individual observations? E.g. going back to my initial example of a 3-bin model:

[0, 1, 2]
This is currently passed in to the categorical mixture model as
[2,2,1]

This is OK for smaller datasets, but as I am modeling DNA sequencing data with millions of molecules that fall into a relatively small number of bins, it becomes limiting unless pymc3 actually represents things as counts under the hood (which I don’t think it does in my current formulation).

Said another way, the effort required in computing the likelihood of a simple binomial model shouldn’t depend on the total observation count, since you’re just raising p and 1-p to some potentially large powers.

So - is this a matter of reformulating / re-specifying my model, or are there native ways to encode observed data in this way that avoids arrays having length of the size of my data?

#6

Pymc3 does not reformulate the model you provide it with under the hood. For example, if you have a string of iid Bernoulli observations, and you’re only interested in find p, pymc3 does not reformulate the model into a binomial, which is more memory friendly and also faster in this case.

Looking at your initial model, I think that you should be able to supply the total counts if you increase each multinomial’s n.