Posterior distribution clustering


I am currently working in a very degenerate inference case in which multiple-peaked posteriors are the norm. This is related to the scarcity of data (Cassini mission radar data) and the use of a relatively complex physical model as the mean value of the likelihood. Of course every peak in the posterior has a sound physical interpretation.

I want an automatic way to identify and characterize these ‘posterior clusters’ using a bayesian scheme (since this is a partition problem). I have found the R package ‘mcclust’ that just do that, but is in R.

Is there a way to write the clustering problem (including the loss function) in PyMC3?

Thank you!

1 Like

My first thought would be to apply k-means to your posterior samples. Do you need some sort of downstream uncertainty quantification or other issue that would require using a Bayesian model for it?

1 Like

I don’t know anything about this type of problem, but looking at the docs and code for mcclust, it should be extremely easy to write up yourself.

First, you need to construct a posterior similarity matrix, which (from the doc of mcclust) is: “a matrix whose entry [i, j] contains
the posterior probability that observation i and j are together in a cluster.”

This is just a bunch of loops. You might be able to optimize it somehow, but I just went the 5-head way and used numba. Recreating the example from the mcclust docs:

from numba import njit
import numpy as np

# Shape is n_draws x n_obs, and the numbers are class labels
posterior = np.array([[1, 1, 2, 2],
                      [1, 1, 2, 2],
                      [1, 2, 2, 2],
                      [2, 2, 1, 1]])

def build_psm(posterior):
    n_draws, n_obs = posterior.shape
    psm = np.zeros((n_obs, n_obs))
    for row in posterior:
        for r_idx, i in enumerate(row):
            for c_idx, j in enumerate(row):
                if i == j:
                    psm[r_idx, c_idx] += 1
    psm /= n_draws
    return psm

This matches the output on the R sample problem, and the output also matches using the cls.draw2 dataset that comes with the R package. The 500 x 500 matrix takes 48ms to build the psm, so you might want to think of a way to get that down from O(n^3).

Once you have this object your work is basically done. Digging into the R code, it looks like they use this similarity matrix as a distance matrix in a heirarchical clustering algorithm. I used the sklearn implementation. The linkage setting is the default used in the R program.

from sklearn.cluster import AgglomerativeClustering
hcluster = AgglomerativeClustering(n_clusters = 8, 
predicted_classes = hcluster.fit_predict(1 - psm)

Based on the R code again, the Binder loss (which is what I assume you are talking about?) is just:

np.sum(np.abs(build_psm(predicted_classes[None]) - psm) / 2)

The loss that R gives on that cls.draw2 dataset is 3405.174, so close enough for me, given that I have no idea what the differences in the sklearn and R hierarchical algorithms are.

Glancing over the code it doesn’t look like they do anything with that loss, it just runs the hierarchical clustering and shows you this loss. I could be missing something, though. You might be able to wrap up the whole thing in a function and use scipy.optimize.minimize to find class labels that minimize the loss function. No idea though.


First, thank you very much for the detailed answer. Maybe it was extremely easy for you, but I got stuck in the construction of the posterior similarity matrix. I will study your answer in detail.

Thank you again!

Ah, yeah, that was condescending, sorry. I guess I meant “easy” in the sense that there’s not a big complicated package underneath, just a couple functions we can figure out.

I tried to optimize the build_psm function a bit, but here’s what I came up with:

def build_psm2(posterior):
    n_draws, n_obs = posterior.shape
    psm = np.eye(n_obs) * n_draws
    for row in posterior:
        for r_idx, i in enumerate(row):
            for c_idx, j in enumerate(row[r_idx:], r_idx):
                if (i == j) & (r_idx != c_idx):
                    psm[r_idx, c_idx] += 1
                    psm[c_idx, r_idx] += 1
    return psm / n_draws

Here’s another one that uses matrix multiplication instead of explicitly looping over the rows and columns. Instead we have to loop over the classes.

def build_psm3(posterior):
    classes = np.unique(posterior)
    n_draws, n_obs = posterior.shape
    psm = np.zeros((n_obs, n_obs))
    for c in classes:
        indicator_matrix = (posterior == c) * 1.
        psm += (indicator_matrix.T @ indicator_matrix)
    return psm / n_draws

Benchmarks on the 4x4 example matrix posterior:

%timeit build_psm(posterior)
>>932 ns ± 4.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit build_psm2(posterior)
>>1.1 µs ± 4.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit build_psm3(posterior)
>>3.41 µs ± 19.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Benchmarks on the cls.draw2 dataset (500 draws x 400 observations):

%timeit build_psm(data.values)
>>26 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit build_psm2(data.values)
>>34.8 ms ± 740 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit build_psm3(data.values)
>>41.3 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On a “large” problem: 50 draws x 10,000 observations:

%timeit build_psm(big_data)
>>4.71 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit build_psm2(big_data)
>>7.97 s ± 74.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit build_psm3(big_data)
>>5.98 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Surprisingly, the first version is always fastest, so my “optimizations” weren’t very optimizing. I don’t know why that is when build_psm2 does half as many loops. Numba might notice we’re looping over the same object twice and do the optimizations itself, so the changes just get in the way. I’m not sure. I tried splitting the posterior into rows and building the psm in parallel, but it wasn’t faster than just using Numba (plus big overhead when the samples are big). Maybe the matrix multiplication method could be faster if all your samples are on a GPU? From the big data test, It looks like it might scale better than the simple loops.

1 Like