Divergences and label-switching in a Dirichlet process mixture model

TL;DR: I’m attempting to perform non-parametric clustering of circular data using a Dirichlet process mixture of von Mises distributions. I’m trying to fix label-switching using transforms.ordered, but my chains contain only divergent samples.

Details:
The observed data are stimulus values (colour) from a cognitive psychology experiment. Each trial of the experiment used 6 stimuli sampled from a uniform distribution. For our analysis we would like to characterize possible clusters of stimuli that the observer might perceive. Here’s an example stimulus set that could reasonably be clustered into 2-3 von Mises components: (apparently, I can only attach one image… data in a reply below).

Here is the model that I’ve implemented based on examples from the community:

from scipy import stats
import matplotlib.pyplot as plt
import pymc3 as pm
from theano import tensor as tt
import pymc3.distributions.transforms as tr
def stick_breaking(beta):
    portion_remaining = tt.concatenate([[1], 
                        tt.extra_ops.cumprod(1 - beta)[:-1]])

    return beta * portion_remaining

k = 10 # truncate the Dirichlet process at 10 components

with pm.Model() as m:
    # mixture weights
    alpha = 0.1
    beta = pm.Beta('beta', 1, alpha, shape=k)
    pi = pm.Deterministic('pi', stick_breaking(beta))
    
    # mixture parameters
    kappa = pm.Gamma('kappa', 5, 1, shape=k) 
    mu = pm.Uniform('mu', -np.pi, np.pi, shape=k)
    
    # initialize VM components
    all_vms = []
    for vm in range(k):
        all_vms.append(pm.VonMises.dist(mu=mu[vm], 
                                        kappa=kappa[vm]))
    
    obs = pm.Mixture('obs', w=pi, comp_dists=all_vms, 
                     observed=sample)

(The prior on kappa is intended to bias components toward narrow clusters, and is based on previous experiments. I’ve found that the choice of prior here doesn’t seem to impact the issue below)

with m:
    trace = pm.sample(1000, tune=500, chains=4, 
                      init='adapt_diag',
                      nuts_kwargs=dict(target_accept=.85))
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, kappa, beta]
Sampling 4 chains: 100%|██████████| 6000/6000 [00:18<00:00, 317.08draws/s]
There were 755 divergences after tuning. Increase `target_accept` or reparameterize.
There were 784 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7670051481506523, but should be close to 0.85. Try to increase the number of tuning steps.
There were 718 divergences after tuning. Increase `target_accept` or reparameterize.
There were 790 divergences after tuning. Increase `target_accept` or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

That’s a lot of divergences! G-R also looks problematic.


Surprisingly, the samples don’t look too bad. The main glaring issue that I can see is the swapping between values for mu__0 and mu__1.

pm.summary(trace_ordered, ['mu'])
	    mean        sd	        mc_error	hpd_2.5	    hpd_97.5	n_eff	    Rhat
mu__0	-2.152960	0.687383	0.064667	-3.138628	-0.805135	9.510335	1.241728
mu__1	-1.447989	0.734422	0.067128	-2.975999	0.182541	28.046633	1.100530
mu__2	-0.936361	0.672227	0.059381	-2.215930	0.748179	36.990873	1.141194
mu__3	-0.658837	0.644452	0.055988	-1.845731	0.836822	42.063842	1.105958
mu__4	0.094493	0.761047	0.064515	-1.225197	1.578090	10.037738	1.188514
mu__5	0.493356	0.734129	0.061158	-0.811234	1.986704	17.941033	1.136716
mu__6	0.800890	0.786063	0.065782	-0.796682	2.252840	8.677572	1.231766
mu__7	1.440635	0.701671	0.056108	-0.038637	2.613982	14.950476	1.151567
mu__8	2.080768	0.591865	0.043292	0.962140	3.026773	38.514392	1.064755
mu__9	2.639729	0.477819	0.027276	1.715418	3.140600	198.288401	1.015101

A pairplot confirms the swapping: (can’t include another image… grr)

I’ve encountered this swapping before in finite-component models, and solved it with the ordered transform. However, when I try to use the same transform here, my chains become entirely divergent:

k = 10 # truncate the Dirichlet process at 10 components
with pm.Model() as m_ordered:
    # mixture weights
    alpha = 0.1
    beta = pm.Beta('beta', 1, alpha, shape=k)
    pi = pm.Deterministic('pi', stick_breaking(beta))
    
    # mixture parameters
    kappa = pm.Gamma('kappa', 5, 1, shape=k) 
    mu_testval = np.linspace(min(sample), max(sample), k)
    mu = pm.Uniform('mu', -np.pi, np.pi, shape=k, 
                    transform=tr.ordered, testval=mu_testval)
    
    # initialize VM components
    all_vms = []
    for vm in range(k):
        all_vms.append(pm.VonMises.dist(mu=mu[vm], 
                       kappa=kappa[vm]))
    
    obs = pm.Mixture('obs', w=pi, comp_dists=all_vms, 
                     observed=sample)
with m_ordered:
    trace_ordered = pm.sample(1000, tune=500, chains=4, 
                              init='adapt_diag',
                              nuts_kwargs=dict(target_accept=.85))
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, kappa, beta]
Sampling 4 chains: 100%|██████████| 6000/6000 [00:14<00:00, 413.94draws/s]
The chain contains only diverging samples. The model is probably misspecified.
The chain contains only diverging samples. The model is probably misspecified.
The chain contains only diverging samples. The model is probably misspecified.
The chain contains only diverging samples. The model is probably misspecified.
The acceptance probability does not match the target. It is 0.7267639544474495, but should be close to 0.85. Try to increase the number of tuning steps.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

The traceplot is (predictably) a mess.

Am I missing something wrt the correct use of transform.ordered? Based on the number of divergences, is label-switching even my main concern here? On the odd sampling run, switching doesn’t occur but I still end up with many divergences. In those cases, the samples still seem reasonable, and the posteriors for component parameters (and component weights) allow me to adequately cluster my stimuli. Obviously, I’m leery of using these clusters, and I’d like to see much prettier convergence.

Any advice would be greatly appreciated!

Attaching the data used to clarify why I said the samples aren’t too bad:

The posterior weights indicate only 2 significant components, and the posterior means would be in the right place if they weren’t swapped.