Matrix inverse running so slow

I have a model that includes the calculation of an inverse matrix. The matrix is 8k * 8k. The estimation time of the model using NUTS is 432 hours for only 1000 iterations (plus 1000 tunes). I put the simplified model below (the true model only has more parameters, but the structure is the same). I will appreciate it if anyone has any idea to increase its speed.

with pm.Model() as model:  # model specifications in PyMC3 are wrapped in a with-statement
   # Parameters 
    b0_e = pm.Normal('b0', 0, sigma = 1)
    b11_e = pm.Normal('b11', 0, sigma = 1)
    b12_e = pm.Normal('b12', 0, sigma = 1) 
    b41_e = pm.Normal('b41', 1, sigma = 1)
    b42_e = pm.Normal('b42', 1, sigma = 1)
    sigma2 = pm.ChiSquared('e2', nu=2) 

   # Calculation of inverse matrix
   # G1 is 8000 * 8000 
    bg_e = np.eye(N) - b2_e * G1 
    bg_e_t = np.eye(N) - b2_e * G1.transpose()
    i_bg_e = pm.math.matrix_inverse(bg_e)
    i_bg_e_t = pm.math.matrix_inverse(bg_e_t)
   
   # linear combinations 
    bx_all = np.ones(N)*b0_e + b11_e * X[:,0] + b12_e * X[:,1] +b41_e * GX1[:,0] + b42_e * GX1[:,1] 
   # matrix multiplication 
    mu_y = pm.math.matrix_dot(i_bg_e, bx_all) 
    sigma_y = pm.math.extract_diag(pm.math.matrix_dot(i_bg_e, i_bg_e_t)) * sigma2

    likelihood = pm.Normal('y', mu = mu_y ,  sigma = sigma_y, observed = Y)

    trace = pm.sample(1000, return_inferencedata=True, cores = 2)

Matrix inversion is an ~O(n^3) operation; for large matrices it’s always going to be a bottleneck. I quickly tested np.linalg.inv on an (8000, 8000) square matrix, and it took about 5 seconds on my machine. If you need to invert twice on every single logp evaluation (plus gradients!) it’s a hard problem.

My first question is whether we can eliminate the second inversion. In your code, bg_e.T == bg_e_t, and (A^T)^{-1} = (A^{-1})^T, so you can simplify a bit:

  bg_e = np.eye(N) - b2_e * G1 
  i_bg_e = pm.math.matrix_inverse(bg_e)
  i_bg_e_t = i_bg_e.T

This would eliminate one inverse, which should give you a significant speedup. This depends on b2_e being scalar. If in your true model it’s a vector, you need to think more about whether there is still an equivalence. But eliminating this second matrix inverse is where I’d start.

Otherwise, there might be structure in G you exploit. For example, if is it positive definite, you could use at.linalg.solve(G, EYE, assume_a='pos', check_finite=False) to invert using a solver optimized to positive definite matrices.

2 Likes

Thank you! The estimated running hours have decreased to 130 by taking your first suggestion.

The matrix is not positive definite, but this matrix is a block diagonal matrix. I tried to implement a function that calculates the inverse of each block then, but the speed is even slower. (Each block has a size of about 500 * 500).

I wonder if it is any other efficient way to calculate the inverse of a block diagonal matrix.

def block_mat_inv(old_G, month_size):
   ## old_G is the matrix to be inversed, month_size is the width of each block
    
    row_idx = 0

    # for the first one 
    row_range = np.arange(row_idx, row_idx + month_size[0])
    bg_e_small = old_G[row_range,:][:,row_range]
    i_bg_e_small = pm.math.matrix_inverse(bg_e_small)
    block1 =  i_bg_e_small.copy()

    row_idx = row_idx + month_size[0]

    old_shape = month_size[0]

    for i in range(len(month_size) - 1):
        row_range = np.arange(row_idx, row_idx + month_size[i+1])
        
        bg_e_small = old_G[row_range,:][:,row_range]
        i_bg_e_small = pm.math.matrix_inverse(bg_e_small)

        new_shape = month_size[i+1]

        new_G = pm.math.block_diagonal((block1, i_bg_e_small))

        block1 = new_G.copy()

        row_idx = row_idx + month_size[i+1]
        old_shape = old_shape + new_shape
        
    return new_G

If you have a block matrix it will be much faster to invert the blocks then assemble it. I think the slower speed you saw in your test is just how related to how you implemented it. I am thinking the easiest way will be to store the sub-matrices in a list, then use a map on the list. Here’s the idea in numpy:

import numpy as np
from scipy import linalg

As = np.random.normal(size=(16, 500,500))
A = linalg.block_diag(*As) # 8000 x 8000

#baseline test
%timeit np.linalg.inv(A)
>> 4.79 s ± 81.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# blockwise inverse
%timeit linalg.block_diag(*list(map(np.linalg.inv, As)))
>> 51.4 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So you get something like a 100x speedup. For PyMC, you can convert each sub-block into an aesara tensor, then do pretty much the same thing:

import aesara.tensor as at
import aesara
import pymc as pm

As = np.random.normal(size=(16, 500,500))
As = list(map(at.as_tensor_variable, As)) # not sure if this is needed.

#baseline
%timeit at.linalg.inv(pm.math.block_diagonal(As)).eval()
>> 6.23 s ± 50.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pm.math.block_diagonal(list(map(at.linalg.inv, As))).eval()
>> 1.36 s ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here I got a 6x speedup. Speeds are slower than numpy because I’m calling .eval() over and over, which incurs more overhead than actually compiling the function and calling it repeatedly.

It might be possible to further optimize this, but I’m not sure. It’s also possible that even though we see a speed increase on the computation of the inverse, the computation of gradients will still be slower. Obviously I’m not testing that here, so you will need to keep that in mind.

Really the fastest speedup would come if you could invert your matrix once and re-use it every time. This would be possible (again, assuming b2_e is always a scalar), because (aX)^{-1} = \frac{1}{a}X^{-1}. The presence of the identity matrix messes things up though, as (I - aX)^{-1} can’t be re-written in terms of X^{-1} only (or maybe it can – I didn’t work on the problem for very long!)

Thank you so much! It is further speeded up by about 40%. (Still a lot.)

Instead of using aesara tensor, my implementation is as below. In my true model, I have two block matrix (G1 and G2) and two parameters. I guess this is why it does not speed up as much as your implementation.

bg_e = []
row_idx = 0 
for i in range(len(month_size)):
  row_range = np.arange(row_idx, row_idx + month_size[i])
  new_G = np.eye(month_size[i]) - b2_e * new_G1[i] - b3_e * new_G2[i]
  bg_e.append(new_G)

i_bg_e = pm.math.block_diagonal(list(map(pm.math.matrix_inverse, bg_e)))

And I agreed with you, the best way is to find an analytical solution for (I - aX)^{-1}. In my case, X = D*M, where D is a positive diagonal matrix and M is a positive definite matrix. I searched but did not find anything that can solve this. Do you happen to know any simplification for this form?

My point isn’t that you want an analytic solution to (I - aX)^{-1} (in some sense this is the analytic solution to the problem). Instead, you want to re-write the problem so that there are no random variables inside the inverse operation. This would let you can define variables G1_inv as the inverse of G outside the model, sample bg_e, then compute bg_e_t = 1 / bg_e * G1_inv inside the model, which is a fast elementwise operation.

For example, if you didnt’ subtract bg_e * G from the identity matrix, this would be possible. I assume this equation is somehow derived, so you can’t just change it. Nevertheless, it might also be possible by staring at the problem for a bit and doing some algebra. I’m not sure, you’ll have to push a pen. I don’t think a decomposition of X is what you want, though.

Otherwise, you could try using at.linalg.solve instead of pm.math.matrix_inverse. You can set check_finite=False, which I’ve found offers a modest speedup in my testing.