Data representation for mixtures of multinomials, and Categorical vs Mixture?

breadbaron · March 9, 2019, 11:08pm

I’m trying to model my data as a mixture of two multinomials, where one component has been previously estimated, and I’m trying to infer the remaining component (comp_1 below). I’m wondering if my model specification is leading me to a very memory-inefficient representation. The model looks something like:

    w_1 = pm.Dirichlet('w_1', a=prior, shape=numBins)
    comp_1 = pm.Multinomial('c_1', p=w_1, n=1, shape=numBins).distribution
    comp_2 = pm.Multinomial('c_2', p=previously_estimated, n=1, shape=numBins).distribution
    weights = [0.1, 0.9]
    
    x = pm.Mixture('x' , comp_dists = [comp_1, comp_2], w = weights, observed = observations )            
    trace = pm.sample(3000, tune = 500)

On a high level, the observations can be summarized as a vector of length numBins with a count in each, like this for a 3-bin model:

[ 0, 1, 2 ]

As I understand, setting n=1 in the multinomial specification ensures that each observed count in my input data is free to come from either mixture component. But this means that the input data must be specified as a list of observation vectors, each of length numBins, and each having sum 1. For numBins = 3 the above observed data would look something like:

[ [0, 0, 1], [0, 0, 1], [0, 1, 0] ]

So my question is - are there technical and/or statistical limits I’m putting on this model by encoding it this way, with Multinomial n=1? In practice, it seems like the RAM required to load a set of observations into a model is orders of magnitude more than they occupy outside of pymc3.

Second, and possibly unrelated - is it better to encode the source component using pm.Categorical rather than specifying the mixture with pm.Mixture? Something like what’s presented in https://docs.pymc.io/notebooks/gaussian_mixture_model.html?

category = pm.Categorical('category',
                              p=p,
                              shape=ndata)

When is pm.Category preferred to using pm.Mixture?

breadbaron · March 9, 2019, 11:35pm

OK - in a quasi-related way, is it better in this case to encode the individual observations as bin-valued Categorical variables, rather than multinomials? It seems if I do this I can specify the observations from above as (zero-based):

[2, 2, 1]

Sorry that I’m asking about two different uses of the Categorical class - one to encode mixture components and the other for the Multinomial( n =1 ) case. Hopefully my question(s) make sense - let me know and I can clarify.

lucianopaz · March 10, 2019, 6:38pm

In my opinion yes, it is better to use categorical instead of multinomial. And yes, you use zero based ordering.

About memory consumption, multinomial indeed uses more memory because observations are (N, len(p)) instead of (N,). Furthermore, pymc3 uses up more memory because of two reasons:

If you run with more than a single core, multiprocessing copies the model around (with all its observations) into each process that runs the sampling.
When compiling the model’s logp there are some cloning involved which may produce copies in memory, and not just views of other memory addresses.

If you are memory constrained, use the categorical distribution instead of the multinomial and sample with less cores.

Regardless, I imagine it will be very hard to get good mixing of two discrete distributions. I strongly recommend that you stick with the mixture distribution and not model the latent class as a categorical

breadbaron · March 18, 2019, 6:14pm

Thanks! Very helpful

breadbaron · April 15, 2019, 11:14pm

@lucianopaz and others - related question: is it possible to specify the observed data as a collection of counts, rather than individual observations? E.g. going back to my initial example of a 3-bin model:

[0, 1, 2]
This is currently passed in to the categorical mixture model as
[2,2,1]

This is OK for smaller datasets, but as I am modeling DNA sequencing data with millions of molecules that fall into a relatively small number of bins, it becomes limiting unless pymc3 actually represents things as counts under the hood (which I don’t think it does in my current formulation).

Said another way, the effort required in computing the likelihood of a simple binomial model shouldn’t depend on the total observation count, since you’re just raising p and 1-p to some potentially large powers.

So - is this a matter of reformulating / re-specifying my model, or are there native ways to encode observed data in this way that avoids arrays having length of the size of my data?

lucianopaz · April 17, 2019, 6:28am

Pymc3 does not reformulate the model you provide it with under the hood. For example, if you have a string of iid Bernoulli observations, and you’re only interested in find p, pymc3 does not reformulate the model into a binomial, which is more memory friendly and also faster in this case.

Looking at your initial model, I think that you should be able to supply the total counts if you increase each multinomial’s n.

Topic		Replies	Views
Multinomial mixture with observed-values themselves as mixtures Questions	8	1033	March 29, 2019
How to create model with series of multinomials, each of which has observations on only a subset of components? v5 modeling	1	339	November 27, 2022
Posterior to Prior in Categorical distribution (or encapsulating multiple data sources to categorical analysis) Questions modeling	4	1771	February 28, 2022
Why my multinomial model with categorical predictors and response differs so much from results in R? version agnostic modeling	5	95	January 26, 2025
How to implement Zero-Inflated Multinomial? Questions	24	2844	November 10, 2019

Data representation for mixtures of multinomials, and Categorical vs Mixture?

Related topics