How can we build a mixture of mixtures?


#1

Hi,

being able to build Mixture of any kind of distributions (not only Gaussian) is great, thanks to Pymc3 !

It’s so great that, now, I’d like to be able to build models by mixing mixtures!
(I’d like to build a mixture of Gaussian and a mixture of lognormal, and then a mixture of these two mixtures)

But it’s doesn’t work. I got error message telling that the mixtures I’m giving as component
have no “logp” method :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/.conda/envs/pymc3.3/lib/python3.6/site-packages/pymc3/distributions/mixture.py in _comp_logp(self, value)
     81 
---> 82             return comp_dists.logp(value_)
     83         except AttributeError:

AttributeError: 'list' object has no attribute 'logp'

A work around would be to create one unique mixture, with a long list of Gaussian and log normal distributions… The problem is that I want a long list of Gaussian and lognormal distributions.

But the mixture class only accept a list of single distributions (possibily heterogeneous) or a unique multidimensional (homogeneous) distribution… Since I need to have both Gaussian and log normal, this means I’ll have to declare these distributions “by hand” , one line of code after another (that’s horrible), moreover this means I can’t change the number of components without adding lines of code to my model…

I don’t see theoretical limitation to build such model, but I face a lot of technical coding limitations.

This is so frustrating… Any help, or suggestion will be appreciated.

regards,
Hubert.


#2

I think the mixture class currently is not flexible enough to handle these kinds of situations. I would probably write a custom logp.

I dont think you need to add a new line for more components, as for Gaussian and LogNormal you can specify the shape - or maybe I am misunderstanding something?


#3

ok, thanks for the advice.

Mixtures accept either a list of one dimensional distribution, or a unique multidimensional distribution, but you can’t mix the two.
It’s written here : http://docs.pymc.io/api/distributions/mixture.html
The given exemple is quite explicit.

The solution I found is to build mixture components in a loop, and store them
in a list, passed to the mixture class.
It works, but I’ve read that it’s a bad thing to have explicit loops in model !!
(I’ve read that’s it’s far slower that using multidimensional distributions)

So it could be great to have a more flexible way to specify components to mixture.


#4

Yes ideally you should try to avoid using loops, as you generate a lot of new ops to the computational graph. Sometimes, what you can do is stacking the RVs together:

with pm.Model() as m:
    tostack = []
    for i in range(K):
        tostack.append(...)
    stacked_RV = tt.stacklists(tostack)

Doing so, theano might be able to perform some graph optimization to make the computation efficient.

I agree - we are working towards that currently. The main roadblock is the RV/distribution shape and its smart broadcasting issue, which creates headaches from time to time. There is a WIP to refactor it to make it more flexible and precise (https://github.com/pymc-devs/pymc3/pull/2833). Hopeful that goes well and we can make the mixture class more flexible.


#5

Thanks for this hint… But I certainly misunderstood something since It makes an error.

I have two vectors of dimension 10 and 5, I’d like to stack them to obtain a 15 dimensions vector…
This is my code :

w = [ ]
w.append(pm.Dirichlet('w_1',a=np.array([0.0000001]*10)))
w.append(pm.Dirichlet('w_2',a=np.array([1]*5)))
w_ = tt.stacklists(w)

The error occurs on the stacklists line: 

"ValueError: all the input array dimensions except for the concatenation axis must match exactly"

So, just for a try (because it’s not what I’m looking for), I set the same dimension (10) for all the vectors :

w = [ ]
w.append(pm.Dirichlet(‘w_1’,a=np.array([0.0000001]*10)))
w.append(pm.Dirichlet(‘w_2’,a=np.array([1]*10)))
w_ = tt.stacklists(w)

Then The stack operation seems to work, at least it do not produce error… But it fails later in my code because the dimension of w_ is now 10 (and should be 15 for my application).
But this surprises me because I though that the dimension of w_ would be 20 then (10+10),
so I think I really don’t understand how to stack/group random variables…

Any help will be appreciated.


#6

I think what you are looking for is:

w2 = tt.concatenate(w, axis=0)

You could find more information in the theano doc: http://deeplearning.net/software/theano/library/tensor/basic.html


#7

This is weird…

The concatenate function works on the weight vector, more exactly it works to pack a Dirichlet distribution with a bunch of constant values (that I had to stick together with the “stack” method, by the way).
But the “concatenate” method output this error when used on the distribution of the mixture :

AsTensorError: ('Cannot convert <pymc3.distributions.continuous.Lognormal object at 0x7fbcc12c2438> to TensorType', <class 'pymc3.distributions.continuous.Lognormal'>)

I found it very hard to understand : all I want to do here is to pack one dimensional distributions (and constants) to build a heterogeneous mixture, but sometimes I need to use the “stack” function, sometimes the “concatenate”, and sometimes neither works…

I really don’t understand why what works for the weight vector distribution doesn’t work for the mixture components… Both are distributions!!

So, for the moment, my model can be executed, but is awfully slow and consume all the memory of my computer. I can only run it on a small dataset and reduced model dimension.

I guess , it is due to my last un-optimized loop. I really need to translate the python list of distributions mixture elements into a theano structure.

I tryied stack, stacklists, concatenate, but none works to stick the mixture elements together (which are log-normal, normal, and one halfnormal.)

Help of any kind will be appreciated.


#8

I still quite strongly suggest you to write down the mixture logp directly. That way you can directly verify the result, avoid the loops and stacking etc. Since the density is one function as other densities, it would be faster as well. As I mentioned earlier:

You can also post your code with (simulation) data here.


#9

Ok, I try your strategy : building my own logp mixture…

This is my plan :

  • I build a dataset that is a 50:50 % mixture of two lognormal.
  • I build a model composed of 5 lognormal with (exponential) priors on mu and sd.
  • Then I create an observable DensityDist, with a call to a custom lopg function…

The problem arise when trying to pass the mixture distribution components as argument to the logp function : I just can’t find a way that works!

This is my code, and the error I get :

import sys
import pymc3 as pm, theano.tensor as tt
import numpy as np

import matplotlib.pyplot as plt

# simulated data :  a mix of 2 lognormal
data = np.concatenate( (
    np.random.lognormal(mean=3,sigma=0.5,size=1000),
    np.random.lognormal(mean=5,sigma=0.5,size=1000)) )

nbr = 5

def logp(components,w, data): # the custom logp function     #
    lp = 0
    for d in data: # for all data point
        for i in range(nbr):  # for all components of the mixture
            c = components[i].logp(d) # logp value of data again each components
            lp += (w[i]*c) # integrate weighting
    return lp #tt.ones(1)

with pm.Model() as model:
    components = []
    for i in range(nbr): # build the components list
        components.append(pm.Lognormal.dist(mu=pm.Exponential('mu_'+str(i), lam=1) 
        , sd=pm.Exponential('sd_'+str(i), lam=1)))
    w = pm.Dirichlet('w',a=np.array([0.0000001]*nbr)) # weight vector for the mixture

    # home-made mixture with a custom logp function
    mix = pm.DensityDist('mix',logp,observed={'w':w,'data':data,'components':components})

    # MCMC
    trace = pm.sample(1000, tune=1000)

    # graphes
    pm.traceplot(trace)
    plt.show()

And this is the error I get :
“TypeError: float() argument must be a string or a number, not ‘Lognormal’”

Which correspond to this line of my code :

mix = pm.DensityDist('mix',logp,observed={'w':w,'data':data,'components':components})

The logp function is certainly not correct too, but I can’t even get to the point to execute it!!

I tried to “tt.stack” or “tt.concatenate” on the “components” list, before the call to “DensityDist”, but these functions only work with random variable (and not “.dist” object).
If I declare the components as “pm.Lognormal” (not pm.Lognormal.dist") then I can tt.stack or tt.concatenate them and pass it to the custom logp function, but then I can’t access the lopg function of the “components”!
(and, anyway, I guess that Pymc3 will automatically add a logp cost from each components wich will lead to a erroneous global logp function)

I’m so clueless … ;(

I need to pass the mixture components to the custom logp function to be able to build the mixture logp, but it seems that DensityDist prevents me to do so.


#10

Oh OK, I think I might have lead you to a wrong path :sweat_smile:
In your case, you can use pm.Mixture out of the box:

nbr = 5
with pm.Model() as model:
    w = pm.Dirichlet('w', a=np.array([0.0000001]*nbr)) # weight vector for the mixture
    mu = pm.Exponential('mu', lam=1, shape=nbr)
    sd = pm.Exponential('sd', lam=1, shape=nbr)
    components = pm.Lognormal.dist(mu=mu, sd=sd, shape=nbr) 
    obs = pm.Mixture('y', w, components, observed=data)

#11

Ok, I knew that I could use the mixture in that case. The previous code was a toy problem showing the encountered problem while trying to build a mixture myself…

Meanwhile I found an other work around for my problem :wink:

And then, I face another strange behavior :cry: : mixtures behaves differently if they have been built from a unique multi-dimensional distribution, or from a list of mono-dimensional distributions.

This is the method using a list of mono-dimensional distributions (let’s call this the ‘A’ version of the code) :

with pm.Model() as model:
    ln_nbr = 5
    ln_components = []
    for i in range(ln_nbr): # explicit loop (bad!?)
        mu = pm.Exponential('mu_'+str(i), lam=1) 
        sd = pm.HalfNormal('sd_'+str(i), sd=1) 
        ln_components.append(pm.Lognormal.dist(mu=mu,sd=sd)) #,shape=ln_nbr)
    ln_w = pm.Dirichlet('ln_w',a=np.array([0.0000001]*ln_nbr))
    ln_mix = pm.Mixture.dist(w=ln_w,comp_dists=ln_components)

In that case I can call : ln_mix.logp(x) (not show in the code example)
and it runs ok.

But If I do the same mixture built from a multidimensional random variable (instead of a list of mono-dimensional ones), then I can’t run the lopg method !! Why ??

The following code (‘B’ version) should (theoretically) be equivalent to the one above (A) : (but it’s not!)

with pm.Model() as model:
    # "continuous" part : lognormal mixture
    ln_nbr = 5
    mu = pm.Exponential('mu', lam=1, shape=ln_nbr)
    sd = pm.HalfNormal('sd', sd=1, shape=ln_nbr)
    ln_components = pm.Lognormal.dist(mu=mu,sd=sd,shape=ln_nbr)
    ln_w = pm.Dirichlet('ln_w',a=np.array([0.0000001]*ln_nbr))
    ln_mix = pm.Mixture.dist(w=ln_w,comp_dists=ln_components)
    print(" end lognorm part...")

When trying to access the “ln_mix.logp(x)” method, I get this error message :

TypeError: 'Lognormal' object is not iterable

The error is linked to the ln_mix.logp() call.

This is sad since I guess that this later model formulation (B) is the most efficient one (without explicit loop). Am I guessing right ?

If the B version is the most efficient, how can I still access the logp method when building my model that way ?

Thanks for your help (even for the "mis leading " one :wink: , it helped me better understand my problem and the library philosophy.)


#12

It should works also for B version with the shape specified for the Mixture likelihood:
ln_mix = pm.Mixture.dist(w=ln_w,comp_dists=ln_components, shape=x.shape)


#13

Then it’s : ln_mix = pm.Mixture.dist(w=ln_w,comp_dists=ln_components, shape=1)
for my case…(since it’s a one dimensional distribution that I’m trying to fit)

But I still get the same error message… ;(


#14

That’s weird - it works for me:

with pm.Model() as model:
    # "continuous" part : lognormal mixture
    ln_nbr = 5
    mu = pm.Exponential('mu', lam=1, shape=ln_nbr)
    sd = pm.HalfNormal('sd', sd=1, shape=ln_nbr)
    ln_components = pm.Lognormal.dist(mu=mu,sd=sd,shape=ln_nbr)
    ln_w = pm.Dirichlet('ln_w',a=np.array([0.0000001]*ln_nbr))
    ln_mix = pm.Mixture.dist(w=ln_w,comp_dists=ln_components, shape=1)

ln_mix.logp(np.asarray([10.]))

#15

Thanks so much !!! you solved a big part of my problem !!

But now, all this is even stranger for me :wink:

The shape=1 parameter has no effect for me (anyway if not provided, I guess that 1 is the default value(?)). So I guess this has neither effect for you…

The real effect comes from the np.asarray([ ]) inside the lopg call ! It’s not needed for the ‘A’ version (with a list of one-dimensional distribution components), but it is mandatory on the ‘B’ version (with a single multidimensional distribution list as components).

How did you find this formulation/solution ?
It so weird because , for me, A & B version only differs in the way they provide mixture distribution components, not it’s input, so why do we have to shape the input differently ?

Now, I’d like to be able to call the logp function with an array, not a single point (to avoid loop! again…)…
How can I do this ? (all my attemps failed :cry: )


#16

It’s mostly trial and error until I get how it is… these are edge cases that we are trying to improve.
as for calling logp with an array, try below:
ln_mix.logp(np.random.lognormal(size=100))


#17

trying this :

mu_g = np.arange(5,data_max,5)
lp = np.exp(ln_mix.logp(np.asarray([mu_g])))

gives this error message :

ValueError: Input dimension mis-match. (input[0].shape[1] = 40, input[1].shape[1] = 5)

where 40 is the size of mu_g, and 5 is the number of components in the ln_mix mixture.

So I guess the following : when building mixture from multidimensional distribution, we have to provide, and copy input values, for each dimension …?

So I tried to format my input data with this idea :

So starting with this input data :
array([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65,
70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130,
135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200])

I structured it this way :

array([[  5,   5,   5,   5,   5],
       [ 10,  10,  10,  10,  10],
       [ 15,  15,  15,  15,  15],
       [ 20,  20,  20,  20,  20],
       [ 25,  25,  25,  25,  25],
       [ 30,  30,  30,  30,  30],
       [ 35,  35,  35,  35,  35],
       [ 40,  40,  40,  40,  40],
       [ 45,  45,  45,  45,  45],
       [ 50,  50,  50,  50,  50],
       [ 55,  55,  55,  55,  55],
...

In that case the logp function call is accepted ! :wink:

But, the result of the logp seem in the wrong format…This is the error message get when trying to put the result as a weight vector for a new mixture.

ValueError: Not enough dimensions on g_w to reduce on axis -1

When trying to access one lopg value with the [ ] operator, I get this error message :

ValueError: The index list is longer (size 1) than the number of dimensions of the tensor(namely 0). You are asking for a dimension of the tensor that does not exist! You might need to use dimshuffle to add extra dimension to your tensor.

From this error messages, I guess that the logp output is a unique value, where I expect it should be a vector.

So I’m still blocked to get rid of that last loop in my model…


#18

I dont think the mixture logp you are constructing is multidimensional, as in the code above where we have ln_mix is a mixture of 1D lognormal. And by default it will accept a vector as input to evaluate the logp - that’s why you will get an error when you try to pass an 2D array as in lp = np.exp(ln_mix.logp(np.asarray([mu_g]))).

I think there might be some confusion here: the mixture logp with multiple components (i.e., 5 logNormal in this case) is not multiple dimensional, as each component just got “mash” into 1. If you want a multidimensional mixture, then each component has to be multidimensional to start with.


#19

I agree that my mixture should be one dimensional (and that’s what I want).

But we have to admit that this experiment :

mu_g = np.arange(5,data_max,5) # mu_g size is 40 
lp = np.exp(ln_mix.logp(np.asarray([mu_g]))) # a 1 dimensional mixture build on 5 components

=> ValueError: Input dimension mis-match. (input[0].shape[1] = 40, input[1].shape[1] = 5)

tells that the logp function is expecting a 5 dimension vector… no ?
And providing a 5 dimension vector suppress this error, but is it really computing the correct lopg value of the mixture? I don’t know, since this input format is also accepted :

array([[  5],
       [ 10],
       [ 15],
       [ 20],
       [ 25],
       [ 30],
       [ 35],
       ...

Which is not 5-dim, so I don’t know why it says that it expected a 5 dimension input vector !

And anyway I always have an error when trying to use the logp output as 1-D vector
(either in as the weights of a new mixture, or simply trying to access a given element of that output vector).

But, is the lopg function supposed to be used with a vector as an input is the first place ??
(expecting to produce a vector output).

I seems to be true for basic distributions :

ln = pm.Lognormal.dist(mu=1,sd=0.5)
ln.logp([12,13]).eval()
=> array([-7.12059352, -7.68887369])

But not with mixtures (!):

 y = pm.Mixture.dist(w=[0.5,0.5],comp_dists=pm.Normal.dist(mu=np.asarray([1,2]),sd=np.asarray([1,1]),shape=2))
y.logp(np.asarray([12])).eval()
=> array(-51.612058177694394)
y.logp(np.asarray([13])).eval()
=> array(-62.11207558372233)
 y.logp(np.asarray([12,13])).eval()
=>array(-113.72413376141672)

The logp function seems to apply a sum operator, which sounds weird since a logp_sum function also exists :
y.logp_sum(np.asarray([12,13])).eval()
=> array(-113.72413376141672)

This explains the lack of dimension of the lopg output!

But I still don’t know how to call the logp function on an array without an explicit Python loop that totally ruins the performance.


#20

Not really, during the computation, it can potentially accept 5d vector but that’s not what it is intended. The correct input format should be a 1D array eg lp = np.exp(ln_mix.logp(np.asarray([mu_g]).flatten())).

I think this is fixed on master:

mu_g = np.arange(5,50,5)
ln_mix.logp(np.asarray([mu_g]).flatten()).tag.test_value
# outputs 
array([[ -2.96199652],
       [ -5.03014159],
       [ -6.58978369],
       [ -7.85298049],
       [ -8.92231757],
       [ -9.85409017],
       [-10.68262926],
       [-11.43051321],
       [-12.11344185]])