Data representation for mixtures of multinomials, and Categorical vs Mixture?


#1

I’m trying to model my data as a mixture of two multinomials, where one component has been previously estimated, and I’m trying to infer the remaining component (comp_1 below). I’m wondering if my model specification is leading me to a very memory-inefficient representation. The model looks something like:

    w_1 = pm.Dirichlet('w_1', a=prior, shape=numBins)
    comp_1 = pm.Multinomial('c_1', p=w_1, n=1, shape=numBins).distribution
    comp_2 = pm.Multinomial('c_2', p=previously_estimated, n=1, shape=numBins).distribution
    weights = [0.1, 0.9]
    
    x = pm.Mixture('x' , comp_dists = [comp_1, comp_2], w = weights, observed = observations )            
    trace = pm.sample(3000, tune = 500)

On a high level, the observations can be summarized as a vector of length numBins with a count in each, like this for a 3-bin model:

[ 0, 1, 2 ]

As I understand, setting n=1 in the multinomial specification ensures that each observed count in my input data is free to come from either mixture component. But this means that the input data must be specified as a list of observation vectors, each of length numBins, and each having sum 1. For numBins = 3 the above observed data would look something like:

[ [0, 0, 1], [0, 0, 1], [0, 1, 0] ]

So my question is - are there technical and/or statistical limits I’m putting on this model by encoding it this way, with Multinomial n=1? In practice, it seems like the RAM required to load a set of observations into a model is orders of magnitude more than they occupy outside of pymc3.

Second, and possibly unrelated - is it better to encode the source component using pm.Categorical rather than specifying the mixture with pm.Mixture? Something like what’s presented in https://docs.pymc.io/notebooks/gaussian_mixture_model.html?

category = pm.Categorical('category',
                              p=p,
                              shape=ndata)

When is pm.Category preferred to using pm.Mixture?


#2

OK - in a quasi-related way, is it better in this case to encode the individual observations as bin-valued Categorical variables, rather than multinomials? It seems if I do this I can specify the observations from above as (zero-based):

[2, 2, 1] 

Sorry that I’m asking about two different uses of the Categorical class - one to encode mixture components and the other for the Multinomial( n =1 ) case. Hopefully my question(s) make sense - let me know and I can clarify.


#3

In my opinion yes, it is better to use categorical instead of multinomial. And yes, you use zero based ordering.

About memory consumption, multinomial indeed uses more memory because observations are (N, len(p)) instead of (N,). Furthermore, pymc3 uses up more memory because of two reasons:

  1. If you run with more than a single core, multiprocessing copies the model around (with all its observations) into each process that runs the sampling.
  2. When compiling the model’s logp there are some cloning involved which may produce copies in memory, and not just views of other memory addresses.

If you are memory constrained, use the categorical distribution instead of the multinomial and sample with less cores.

Regardless, I imagine it will be very hard to get good mixing of two discrete distributions. I strongly recommend that you stick with the mixture distribution and not model the latent class as a categorical


#4

Thanks! Very helpful