I don’t know anything about this type of problem, but looking at the docs and code for
mcclust, it should be extremely easy to write up yourself.
First, you need to construct a posterior similarity matrix, which (from the doc of mcclust) is: “a matrix whose entry [i, j] contains
the posterior probability that observation i and j are together in a cluster.”
This is just a bunch of loops. You might be able to optimize it somehow, but I just went the 5-head way and used numba. Recreating the example from the mcclust docs:
from numba import njit
import numpy as np
# Shape is n_draws x n_obs, and the numbers are class labels
posterior = np.array([[1, 1, 2, 2],
[1, 1, 2, 2],
[1, 2, 2, 2],
[2, 2, 1, 1]])
n_draws, n_obs = posterior.shape
psm = np.zeros((n_obs, n_obs))
for row in posterior:
for r_idx, i in enumerate(row):
for c_idx, j in enumerate(row):
if i == j:
psm[r_idx, c_idx] += 1
psm /= n_draws
This matches the output on the R sample problem, and the output also matches using the
cls.draw2 dataset that comes with the R package. The 500 x 500 matrix takes 48ms to build the psm, so you might want to think of a way to get that down from O(n^3).
Once you have this object your work is basically done. Digging into the R code, it looks like they use this similarity matrix as a distance matrix in a heirarchical clustering algorithm. I used the
sklearn implementation. The linkage setting is the default used in the R program.
from sklearn.cluster import AgglomerativeClustering
hcluster = AgglomerativeClustering(n_clusters = 8,
predicted_classes = hcluster.fit_predict(1 - psm)
Based on the R code again, the Binder loss (which is what I assume you are talking about?) is just:
np.sum(np.abs(build_psm(predicted_classes[None]) - psm) / 2)
The loss that R gives on that
cls.draw2 dataset is 3405.174, so close enough for me, given that I have no idea what the differences in the sklearn and R hierarchical algorithms are.
Glancing over the code it doesn’t look like they do anything with that loss, it just runs the hierarchical clustering and shows you this loss. I could be missing something, though. You might be able to wrap up the whole thing in a function and use
scipy.optimize.minimize to find class labels that minimize the loss function. No idea though.