I’m trying to model my data as a mixture of two multinomials, where one component has been previously estimated, and I’m trying to infer the remaining component (comp_1 below). I’m wondering if my model specification is leading me to a very memory-inefficient representation. The model looks something like:
w_1 = pm.Dirichlet('w_1', a=prior, shape=numBins)
comp_1 = pm.Multinomial('c_1', p=w_1, n=1, shape=numBins).distribution
comp_2 = pm.Multinomial('c_2', p=previously_estimated, n=1, shape=numBins).distribution
weights = [0.1, 0.9]
x = pm.Mixture('x' , comp_dists = [comp_1, comp_2], w = weights, observed = observations )
trace = pm.sample(3000, tune = 500)
On a high level, the observations can be summarized as a vector of length numBins with a count in each, like this for a 3-bin model:
[ 0, 1, 2 ]
As I understand, setting n=1 in the multinomial specification ensures that each observed count in my input data is free to come from either mixture component. But this means that the input data must be specified as a list of observation vectors, each of length numBins, and each having sum 1. For numBins = 3 the above observed data would look something like:
[ [0, 0, 1], [0, 0, 1], [0, 1, 0] ]
So my question is - are there technical and/or statistical limits I’m putting on this model by encoding it this way, with Multinomial n=1? In practice, it seems like the RAM required to load a set of observations into a model is orders of magnitude more than they occupy outside of pymc3.
Second, and possibly unrelated - is it better to encode the source component using pm.Categorical rather than specifying the mixture with pm.Mixture? Something like what’s presented in https://docs.pymc.io/notebooks/gaussian_mixture_model.html?
category = pm.Categorical('category',
p=p,
shape=ndata)
When is pm.Category preferred to using pm.Mixture?