**TL;DR:** I’m attempting to perform non-parametric clustering of circular data using a Dirichlet process mixture of von Mises distributions. I’m trying to fix label-switching using `transforms.ordered`

, but my chains contain only divergent samples.

**Details:**

The observed data are stimulus values (colour) from a cognitive psychology experiment. Each trial of the experiment used 6 stimuli sampled from a uniform distribution. For our analysis we would like to characterize possible clusters of stimuli that the observer might perceive. ~~Here’s an example stimulus set that could reasonably be clustered into 2-3 von Mises components:~~ (apparently, I can only attach one image… data in a reply below).

Here is the model that I’ve implemented based on examples from the community:

```
from scipy import stats
import matplotlib.pyplot as plt
import pymc3 as pm
from theano import tensor as tt
import pymc3.distributions.transforms as tr
```

```
def stick_breaking(beta):
portion_remaining = tt.concatenate([[1],
tt.extra_ops.cumprod(1 - beta)[:-1]])
return beta * portion_remaining
k = 10 # truncate the Dirichlet process at 10 components
with pm.Model() as m:
# mixture weights
alpha = 0.1
beta = pm.Beta('beta', 1, alpha, shape=k)
pi = pm.Deterministic('pi', stick_breaking(beta))
# mixture parameters
kappa = pm.Gamma('kappa', 5, 1, shape=k)
mu = pm.Uniform('mu', -np.pi, np.pi, shape=k)
# initialize VM components
all_vms = []
for vm in range(k):
all_vms.append(pm.VonMises.dist(mu=mu[vm],
kappa=kappa[vm]))
obs = pm.Mixture('obs', w=pi, comp_dists=all_vms,
observed=sample)
```

(The prior on `kappa`

is intended to bias components toward narrow clusters, and is based on previous experiments. I’ve found that the choice of prior here doesn’t seem to impact the issue below)

```
with m:
trace = pm.sample(1000, tune=500, chains=4,
init='adapt_diag',
nuts_kwargs=dict(target_accept=.85))
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, kappa, beta]
Sampling 4 chains: 100%|██████████| 6000/6000 [00:18<00:00, 317.08draws/s]
There were 755 divergences after tuning. Increase `target_accept` or reparameterize.
There were 784 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7670051481506523, but should be close to 0.85. Try to increase the number of tuning steps.
There were 718 divergences after tuning. Increase `target_accept` or reparameterize.
There were 790 divergences after tuning. Increase `target_accept` or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
```

That’s a lot of divergences! G-R also looks problematic.

Surprisingly, the samples don’t look too bad. The main glaring issue that I can see is the swapping between values for

`mu__0`

and `mu__1`

.
```
pm.summary(trace_ordered, ['mu'])
```

```
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu__0 -2.152960 0.687383 0.064667 -3.138628 -0.805135 9.510335 1.241728
mu__1 -1.447989 0.734422 0.067128 -2.975999 0.182541 28.046633 1.100530
mu__2 -0.936361 0.672227 0.059381 -2.215930 0.748179 36.990873 1.141194
mu__3 -0.658837 0.644452 0.055988 -1.845731 0.836822 42.063842 1.105958
mu__4 0.094493 0.761047 0.064515 -1.225197 1.578090 10.037738 1.188514
mu__5 0.493356 0.734129 0.061158 -0.811234 1.986704 17.941033 1.136716
mu__6 0.800890 0.786063 0.065782 -0.796682 2.252840 8.677572 1.231766
mu__7 1.440635 0.701671 0.056108 -0.038637 2.613982 14.950476 1.151567
mu__8 2.080768 0.591865 0.043292 0.962140 3.026773 38.514392 1.064755
mu__9 2.639729 0.477819 0.027276 1.715418 3.140600 198.288401 1.015101
```

~~A pairplot confirms the swapping:~~ (can’t include another image… grr)

I’ve encountered this swapping before in finite-component models, and solved it with the `ordered`

transform. However, when I try to use the same transform here, my chains become entirely divergent:

```
k = 10 # truncate the Dirichlet process at 10 components
with pm.Model() as m_ordered:
# mixture weights
alpha = 0.1
beta = pm.Beta('beta', 1, alpha, shape=k)
pi = pm.Deterministic('pi', stick_breaking(beta))
# mixture parameters
kappa = pm.Gamma('kappa', 5, 1, shape=k)
mu_testval = np.linspace(min(sample), max(sample), k)
mu = pm.Uniform('mu', -np.pi, np.pi, shape=k,
transform=tr.ordered, testval=mu_testval)
# initialize VM components
all_vms = []
for vm in range(k):
all_vms.append(pm.VonMises.dist(mu=mu[vm],
kappa=kappa[vm]))
obs = pm.Mixture('obs', w=pi, comp_dists=all_vms,
observed=sample)
```

```
with m_ordered:
trace_ordered = pm.sample(1000, tune=500, chains=4,
init='adapt_diag',
nuts_kwargs=dict(target_accept=.85))
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, kappa, beta]
Sampling 4 chains: 100%|██████████| 6000/6000 [00:14<00:00, 413.94draws/s]
The chain contains only diverging samples. The model is probably misspecified.
The chain contains only diverging samples. The model is probably misspecified.
The chain contains only diverging samples. The model is probably misspecified.
The chain contains only diverging samples. The model is probably misspecified.
The acceptance probability does not match the target. It is 0.7267639544474495, but should be close to 0.85. Try to increase the number of tuning steps.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
```

The traceplot is (predictably) a mess.

Am I missing something wrt the correct use of `transform.ordered`

? Based on the number of divergences, is label-switching even my main concern here? On the odd sampling run, switching doesn’t occur but I still end up with many divergences. In those cases, the samples still seem reasonable, and the posteriors for component parameters (and component weights) allow me to adequately cluster my stimuli. Obviously, I’m leery of using these clusters, and I’d like to see much prettier convergence.

Any advice would be greatly appreciated!