I want to apply bayesian statistics in the context of clustering and decided to use normal mixture models for this purpose. The context that I want to use this needs a bit more generality than regular gaussian mixture models available in other packages (see below), hence why I undertook this task.

I put together a model that performs very similar to sklearn’s mixture.GaussianMixture when one uses regular clustering centers as initial values and centers of the normal distribution priors (still quite vague prior around these though, that is large SDs, see below) and uses find_MAP().

On the otherhand sample returns completely weird results unless you very specifically guide the priors. I will show the model before I write more:

```
def bayesian_clustering(data, nclusters_fit, conc=1, mu_sigma=10, alpha=2,
beta=1, est_centers=None, sample=False):
'''
est_centers if supplied must be of shape nclusters_fit, ndims
where ndims = data.shape[1]. It is purpose to supply precalculated
cluster centers (such as from k-means etc) as guidance to the system.
By increasing mu_sigma or decreasing it you can determine how soft the
guidance will be (smaller values are more informative and more guidance).
if sample=True, then sampling will be used for finding parameters if
not then MAP
'''
if est_centers is None:
init = [np.linspace(-np.max(np.abs(data)), np.max(np.abs(data)),
nclusters_fit), None]
center_mus = [init[0], np.tile(data.mean(axis=0)[1:][:,None],
(nclusters_fit,1))]
else:
init = [est_centers[:,0], est_centers[:,1:]]
center_mus = [est_centers[:,0], est_centers[:,1:]]
ndims = data.shape[1]
with pm.Model() as model:
#priors
μ0 = pm.Normal("μ0",
mu=center_mus[0],
sigma=mu_sigma,
shape=(nclusters_fit,),
transform=pm.distributions.transforms.univariate_ordered,
initval=init[0])
μ1 = pm.Normal("μ1",
mu=center_mus[1],
sigma=mu_sigma,
shape=(nclusters_fit, ndims-1),
initval=init[1])
σ = pm.InverseGamma("σ", alpha=alpha, beta=beta)
weights = pm.Dirichlet("w", conc*np.ones(nclusters_fit))
#transformed priors
μ = pm.Deterministic("μ", ptt.concatenate([μ0[:,None], μ1], axis=1))
components = [pm.Normal.dist(μ[i,:], σ) for i in range(nclusters_fit)]
#likelihood
pm.Mixture('like', w=weights, comp_dists=components, observed=data)
if sample:
trace = pm.sample(draws=4000, chains=6, tune=2000,
target_accept=0.95)
else:
MAP = pm.find_MAP()
if sample:
return trace, model
return MAP, model
```

Some notes:

1- Note that I have used univariate normal, this is because the context I want to use this in later requires Censored distributions which is not available for MvNormal.

2- I have imposed an ordering constraint on the first component of the centers to remedy the labelling problem a bit, but I am not convinced that this would be enough. Supplying initial values hopefully does remedy a bit more too but feel like I need to come up with something else. After all each component evaluates independently from each other so other components probably won’t care about the ordering of the first component?

One of the things I have noticed by looking at the posteriors of components of the cluster centers is they are highly multi-modal and that is something which is, I think, expected for high dimensional mixture models. And in fact experiments with number of clusters=2 do seem to suggest that statistics for cluster centers are off because of “label switching”.

On the other hand, in almost all the tests I made, find_MAP() finds the true cluster centers and labels very precisely (when initial values are supplied), which I guess is expected, it converges to nearest minima. When it does not, the uncertainty reflected is in the “right way” (i.e when cluster centers for simulated data are too close or noise is too large etc the points in these clusters are confused). This only works well with sampling if I supply the cluster centers as initial values and keep the mu_sigma low such as mu_sigma=1. Anything above 5 starts creating problems.

The reason why I want to be also able to use sample reliably is down the line I am sampling likelihood of each point belonging to a cluster with something like this:

```
def sample_cluster_likelihoods(model, nclusters_fit, data, trace):
'''
sampling the cluster likelihoods from a given model and trace
'''
with model:
μ = model.μ
σ = model.σ
w = model.w
components = [pm.Normal.dist(μ[i,:], σ) for i in range(nclusters_fit)]
log_p =\
ptt.stack([pm.logp(components[i],data).sum(axis=1)
for i in range(nclusters_fit)])
p = pm.math.exp(log_p)
normalization = (w[:,None]*p).sum(axis=0)
pm.Deterministic("cluster_likelihoods",
w[:,None]/normalization[None,:]*p)
pps = pm.sample_posterior_predictive(trace,
var_names=["cluster_likelihoods"])
return pps
```

and would like to get HDI interval for these if I can (I can do the point estimate version when I have MAP).

So my questions would be

1- Is it safe to be using MAP in a model like this (as the simulated data results suggest)? The advantage of this over existing ones is I can modify the likelihood and also get probability per cluster for each data.

2- Is it ok to get the HDIs by doing the sampling with a relatively informative prior around a local minima found by say normal clustering or dare I say from the MAP?

3- Any suggestions on how to improve this model for better sampling and better remedying the label switching problem?

The full code include data generation functions etc and quite is length so I dumped it here:

note: I have also just come across this topic:

which seems to pretty much speak about what I am having difficulty with and it seems like “relabelling” the chains by hand seems like the most reasonable approach. I have however witnessed label switching even within a chain so this might need to be done on a per sample basis which could be quite slow. I guess one possibility is to add a “confusion cost matrix” that depends on an input baseline set of labels which maybe comes from MAP? Again the point would be to get HDI estimates (with a as soft guidance as possible) and not necessarily finding better centres than MAP.