I don’t know anything about this type of problem, but looking at the docs and code for mcclust
, it should be extremely easy to write up yourself.
First, you need to construct a posterior similarity matrix, which (from the doc of mcclust) is: “a matrix whose entry [i, j] contains
the posterior probability that observation i and j are together in a cluster.”
This is just a bunch of loops. You might be able to optimize it somehow, but I just went the 5-head way and used numba. Recreating the example from the mcclust docs:
from numba import njit
import numpy as np
# Shape is n_draws x n_obs, and the numbers are class labels
posterior = np.array([[1, 1, 2, 2],
[1, 1, 2, 2],
[1, 2, 2, 2],
[2, 2, 1, 1]])
@njit
def build_psm(posterior):
n_draws, n_obs = posterior.shape
psm = np.zeros((n_obs, n_obs))
for row in posterior:
for r_idx, i in enumerate(row):
for c_idx, j in enumerate(row):
if i == j:
psm[r_idx, c_idx] += 1
psm /= n_draws
return psm
This matches the output on the R sample problem, and the output also matches using the cls.draw2
dataset that comes with the R package. The 500 x 500 matrix takes 48ms to build the psm, so you might want to think of a way to get that down from O(n^3).
Once you have this object your work is basically done. Digging into the R code, it looks like they use this similarity matrix as a distance matrix in a heirarchical clustering algorithm. I used the sklearn
implementation. The linkage setting is the default used in the R program.
from sklearn.cluster import AgglomerativeClustering
hcluster = AgglomerativeClustering(n_clusters = 8,
linkage='average',
affinity='precomputed')
predicted_classes = hcluster.fit_predict(1 - psm)
Based on the R code again, the Binder loss (which is what I assume you are talking about?) is just:
np.sum(np.abs(build_psm(predicted_classes[None]) - psm) / 2)
>>3480.89
The loss that R gives on that cls.draw2
dataset is 3405.174, so close enough for me, given that I have no idea what the differences in the sklearn and R hierarchical algorithms are.
Glancing over the code it doesn’t look like they do anything with that loss, it just runs the hierarchical clustering and shows you this loss. I could be missing something, though. You might be able to wrap up the whole thing in a function and use scipy.optimize.minimize
to find class labels that minimize the loss function. No idea though.