Posterior distribution clustering

I don’t know anything about this type of problem, but looking at the docs and code for mcclust, it should be extremely easy to write up yourself.

First, you need to construct a posterior similarity matrix, which (from the doc of mcclust) is: “a matrix whose entry [i, j] contains
the posterior probability that observation i and j are together in a cluster.”

This is just a bunch of loops. You might be able to optimize it somehow, but I just went the 5-head way and used numba. Recreating the example from the mcclust docs:

from numba import njit
import numpy as np

# Shape is n_draws x n_obs, and the numbers are class labels
posterior = np.array([[1, 1, 2, 2],
                      [1, 1, 2, 2],
                      [1, 2, 2, 2],
                      [2, 2, 1, 1]])

@njit
def build_psm(posterior):
    n_draws, n_obs = posterior.shape
    psm = np.zeros((n_obs, n_obs))
    for row in posterior:
        for r_idx, i in enumerate(row):
            for c_idx, j in enumerate(row):
                if i == j:
                    psm[r_idx, c_idx] += 1
                    
    psm /= n_draws
        
    return psm

This matches the output on the R sample problem, and the output also matches using the cls.draw2 dataset that comes with the R package. The 500 x 500 matrix takes 48ms to build the psm, so you might want to think of a way to get that down from O(n^3).

Once you have this object your work is basically done. Digging into the R code, it looks like they use this similarity matrix as a distance matrix in a heirarchical clustering algorithm. I used the sklearn implementation. The linkage setting is the default used in the R program.

from sklearn.cluster import AgglomerativeClustering
hcluster = AgglomerativeClustering(n_clusters = 8, 
                                   linkage='average', 
                                   affinity='precomputed')
predicted_classes = hcluster.fit_predict(1 - psm)

Based on the R code again, the Binder loss (which is what I assume you are talking about?) is just:

np.sum(np.abs(build_psm(predicted_classes[None]) - psm) / 2)
>>3480.89

The loss that R gives on that cls.draw2 dataset is 3405.174, so close enough for me, given that I have no idea what the differences in the sklearn and R hierarchical algorithms are.

Glancing over the code it doesn’t look like they do anything with that loss, it just runs the hierarchical clustering and shows you this loss. I could be missing something, though. You might be able to wrap up the whole thing in a function and use scipy.optimize.minimize to find class labels that minimize the loss function. No idea though.

2 Likes