Properly sampling mixture models

Hello!

As per my last post, I was warned that sampling mixture models is dangerous so I wanted some advice on perhaps some tweaks to the below model. It is an unsupervised clustering model .

with pm.Model() as Model_:
    #Create a covariance matrix for each potential cluster which relates all features of our data
    Lower = tt.stack([pm.LKJCholeskyCov('Sigma_{}'.format(k), n=NumberOfFeatures_, eta=2., 
                                        sd_dist=pm.HalfCauchy.dist(2.5)) for k in range(NumberOfClusters_)])
    Chol = tt.stack([pm.expand_packed_triangular(NumberOfFeatures_, Lower[k]) for k in range(NumberOfClusters_)])

    #The center of each cluster
    
    Mus = tt.stack([pm.Normal('Mu_{}'.format(k), 0.1, 10., shape=NumberOfFeatures_) for k in range(NumberOfClusters_)])
    #W_Min_Potential = pm.Potential('L_potential', tt.switch(tt.min(Weights) < .1, -np.inf, 0))
    

    #Create the multivariate normal distribution for each cluster
    MultivariateNormals = [pm.MvNormal.dist(Mus[k], chol=Chol[k], shape=X.shape) for k in range(NumberOfClusters_)]

    #Create the weights for each cluster which measures how much impact they have 
    Weights = pm.Dirichlet('w', np.ones(NumberOfClusters_)/NumberOfClusters_)
      
    #Due to software bugs, we create a log likelihood by hand
    logpcomp = tt.stack([Dist.logp(theano.shared(X_)) for Dist in MultivariateNormals], axis=1)
    Prob = pm.Deterministic("Probabilities", logpcomp)
    pm.Potential('logp', pm.logsumexp(tt.log(Weights) + logpcomp, axis=-1).sum())

I’m finding that the effective samples is quite low and the gelman-rubin statistic is high. Any help would be greatly appreciated, thank you!

1 Like

Ha, funny I am also working on a case study with similar setup.

rhat is high usually because of label switching. In 1D it is relatively easier to constrain the mu to be ordered, however, in 2D+ this is impossible
For now, my suggestion is that run single chain, and check convergence for each of them, and then assign label by hand. In my experience usually NUTS will explore only one mode (i.e., usually no label switching in a single chain), but across chain the label could be different. Combine this with some relabelling technique it should give you something sensible.

But a proper treatment you need to reparameterized it. To the best of my knowledge, Gaussian mixture model parameterization is best explained in https://arxiv.org/pdf/1601.01178.pdf which I am currently working on :wink: You should check out the blog post by Christian Roberts on this https://xianblog.wordpress.com/2016/01/11/mixtures-are-slices-of-an-orange/. His other blog posts on mixtures are also great https://xianblog.wordpress.com/?s=mixture

FYI, you can use pm.Mixture for multinormal distribution as well now.

Depending on your goal, you might also try to (approximately) fit with variational inference. It will be fast, and should choose a single mode, avoiding label switching. The downside is that you’ll be approximating, rather than getting the exact guarantees from MCMC.

You can use it by replacing trace = pm.sample(N) with

inference = pm.fit(method='advi')  # this is mean field ADVI.  something like 'fullrank_advi' might be tried too
trace = inference.sample(N)

I think our current automatic VI does not work well for 2D gaussian mixture (maybe it is the problem of the optimizer, tranditional it is fitted with conjugate gradient descent), i tried fullrank and mean field but both dont work well.

1 Like

Thanks, I will definitely use one chain and will also use mixture! Some other questions I have are:

  1. I used more than 4 clusters at the beginning using advi and found that 4 were appreciable while the others hovered around the 0.1 cutoff. Is that sufficient reason to use only 4 or is there a more rigorous test?

  2. the choice of prior for the LKJ cov matrix are only positive. Is there a more general prior I could use? I’ve been finding that the full covariance is diagonal after sampling. The data is height, lidar reflected intensity and the lab colors of lidar data so they might be independent in the case of the colors but the intensity should be mildly correlated to the color I feel. If I calculate the correlations from the pandas data frame would it indicate the value the covariance matrix should take?

Again, thank you for your advice. Love this software!

I think the best model is a Dirichlet process mixture, but it is really difficult to do inference. A good benchmark/baseline I find is to use sklearn.mixture.BayesianGaussianMixture, which returns a cluster number

Try HalfNormal. Cauchy distribution is known to be difficult to deal with due to the heavy tail.

It certainly will give you an idea of what kind of range of the prior to choose for the LKJ cov.

So I am attempting to use the mixture class via:

with pm.Model() as Model:
    
    y = theano.shared(X_)
    NumberOfData = g.shape[1]
    #Create a covariance matrix for each potential cluster which relates all features of our data
    Lower = tt.stack([pm.LKJCholeskyCov('Sigma_{}'.format(k), n=NumberOfFeatures, eta=2., 
                                        sd_dist=pm.HalfNormal.dist(sd = 1.)) for k in range(NumberOfClusters)])
    Chol = tt.stack([pm.expand_packed_triangular(NumberOfFeatures, Lower[k]) for k in range(NumberOfClusters)])

    #The center of each cluster
    Mus = tt.stack([pm.Uniform('Mu_{}'.format(k), lower = 0., upper = 1., shape=NumberOfFeatures) for k in range(NumberOfClusters)])

    #Create the multivariate normal distribution for each cluster
    MultivariateNormals = [pm.MvNormal.dist(Mus[k], chol=Chol[k], shape = NumberOfFeatures) for k in range(NumberOfClusters)]

    #Create the weights for each cluster which measures how much impact they have 
    Weights = pm.Dirichlet('w', np.ones(NumberOfClusters)/NumberOfClusters)
      
    #Due to software bugs, we create a log likelihood by hand
    #logpcomp = tt.stack([Dist.logp(theano.shared(y)) for Dist in MultivariateNormals], axis=1)
    #Prob = pm.Deterministic("Probabilities", logpcomp)
    
    Prob = pm.Mixture("prob", w=Weights, comp_dists=MultivariateNormals, shape = NumberOfDataPoints, observed = y)

but I am receiving

/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

ValueError: Input dimension mis-match. (input[0].shape[1] = 5, input[1].shape[1] = 1000)

where 5 is the number of clusters and 1000 is the number of data points. Thank you so much for your help!

Seems to run fine for me:

NumberOfFeatures = 5
NumberOfClusters = 3
y = np.random.randn(1000, NumberOfFeatures)

with pm.Model() as Model:
    # Create a covariance matrix for each potential cluster which relates all features of our data
    Lower = tt.stack([pm.LKJCholeskyCov('Sigma_{}'.format(k), n=NumberOfFeatures, eta=2.,
                                        sd_dist=pm.HalfNormal.dist(sd=1.)) for k in range(NumberOfClusters)])
    Chol = tt.stack([pm.expand_packed_triangular(
        NumberOfFeatures, Lower[k]) for k in range(NumberOfClusters)])

    # The center of each cluster
    Mus = tt.stack([pm.Uniform('Mu_{}'.format(k), lower=0., upper=1.,
                               shape=NumberOfFeatures) for k in range(NumberOfClusters)])

    # Create the multivariate normal distribution for each cluster
    MultivariateNormals = [pm.MvNormal.dist(
        Mus[k], chol=Chol[k], shape=NumberOfFeatures) for k in range(NumberOfClusters)]

    # Create the weights for each cluster which measures how much impact they have
    Weights = pm.Dirichlet('w', np.ones(NumberOfClusters) / NumberOfClusters)

    Prob = pm.Mixture("prob", w=Weights, comp_dists=MultivariateNormals,
                      observed=y)
1 Like

Thank you for your check. What shape was your sample data? I’m thinking that perhaps that’s where I am going wrong

1000 * 5