Ah, yeah, that was condescending, sorry. I guess I meant “easy” in the sense that there’s not a big complicated package underneath, just a couple functions we can figure out.
I tried to optimize the build_psm function a bit, but here’s what I came up with:
@njit
def build_psm2(posterior):
n_draws, n_obs = posterior.shape
psm = np.eye(n_obs) * n_draws
for row in posterior:
for r_idx, i in enumerate(row):
for c_idx, j in enumerate(row[r_idx:], r_idx):
if (i == j) & (r_idx != c_idx):
psm[r_idx, c_idx] += 1
psm[c_idx, r_idx] += 1
return psm / n_draws
Here’s another one that uses matrix multiplication instead of explicitly looping over the rows and columns. Instead we have to loop over the classes.
@njit
def build_psm3(posterior):
classes = np.unique(posterior)
n_draws, n_obs = posterior.shape
psm = np.zeros((n_obs, n_obs))
for c in classes:
indicator_matrix = (posterior == c) * 1.
psm += (indicator_matrix.T @ indicator_matrix)
return psm / n_draws
Benchmarks on the 4x4 example matrix posterior:
%timeit build_psm(posterior)
>>932 ns ± 4.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit build_psm2(posterior)
>>1.1 µs ± 4.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit build_psm3(posterior)
>>3.41 µs ± 19.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Benchmarks on the cls.draw2 dataset (500 draws x 400 observations):
%timeit build_psm(data.values)
>>26 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit build_psm2(data.values)
>>34.8 ms ± 740 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit build_psm3(data.values)
>>41.3 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On a “large” problem: 50 draws x 10,000 observations:
%timeit build_psm(big_data)
>>4.71 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit build_psm2(big_data)
>>7.97 s ± 74.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit build_psm3(big_data)
>>5.98 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Surprisingly, the first version is always fastest, so my “optimizations” weren’t very optimizing. I don’t know why that is when build_psm2 does half as many loops. Numba might notice we’re looping over the same object twice and do the optimizations itself, so the changes just get in the way. I’m not sure. I tried splitting the posterior into rows and building the psm in parallel, but it wasn’t faster than just using Numba (plus big overhead when the samples are big). Maybe the matrix multiplication method could be faster if all your samples are on a GPU? From the big data test, It looks like it might scale better than the simple loops.