KroneckerNormal speed

I’m trying to use KroneckerNormal and finding that it slows down a lot as the dimension of one of the covariance matrices increases. A similar model works quickly using MatrixNormal, but I want additive noise so I am trying to convert it to KroneckerNormal (and I realize these have different runtimes, but am still surprised at how slow KroneckerNormal is). I am not well-versed in theano, but I wondered if anyone could comment on what might be slowing it down, and whether there is something I can do differently (or whether it is an issue inherent in the code).

Here is a notebook demonstrating the models with toy data; as the dimension of one of the covariance matrices grows, KroneckerNormal slows down quite dramatically: https://github.com/natalieklein/intro_to_gp/blob/master/test_kronprod.ipynb

Thanks for any thoughts or advice!

Your code looks good. It looks like the worst offenders are the Scan and Eigh operations. Scan is sort of how Theano does loops, and it’s known to be pretty slow. The pymc3 source where it’s being called by KroneckerNormal is here, by the kron_matrix_op function. It is used to compute things like the dot product (also solves) between n Kronecker’ed matrices and some vector y,

(K_1 \otimes K_2 \otimes \cdots \otimes K_n) y

Scan is used here because Theano doesn’t know n. This problem doesn’t occur for MatrixNormal because it’s for n=2. It would be very good if we could get rid of the use of Scan. That would speed up your code by about ~50%.

It appears that the Eigh decomposition is slower in general than the Cholesky decomposition. When running on the cpu, Theano calls scipy.linalg.eigh and scipy.linalg.cholesky under the hood, see here and here. For timings on my machine (4 core i5 processor using mkl), I get:

# make symmetric 1000x1000 matrix
A = np.random.randn(1000,1000)
K = np.dot(A.T, A) + np.eye(1000)*1e-6

%timeit sp.linalg.eigh(K)
167 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit sp.linalg.cholesky(K)
14.7 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On your machine, these timings ought to match up pretty well with the time per call’s for these ops in your profiling. So Eigh is just slower than Cholesky.

Chapter 5 of Saatchi’s thesis is a good place to find detailed info about using the Eigendecomposition for Kronecker structed covariance with spherical noise.

1 Like

Thanks! Your insight about why scan is slow is helpful. Since my n is always 2, I may be able to modify the code for my use case… I’ll give it a try. I have read Saatchi’s thesis and was considering implementing it myself, but found that it was already implemented here :slight_smile: and since I’m already using pymc for other models I’m comparing this to, it would be nice to continue using pymc.

A small update - I changed the call to scan to:

res1 = kron_vector_op(m)
res = kron_vector_op(res1)

and indeed it speeds things up a lot. I wonder if changing from scan to an actual for loop could help? Or are you not able to count how many times kron_vector_op is going to be called in general?

Without scan, about 80% of the time to calculate the log likelihood goes to Eigh, and about 95% of the time to calculate the gradients goes to EighGrad, so I guess there isn’t much that can be done to speed it up further.

There might be a way to do it with a for loop, and that would definitely be preferable. I gotta think about it some more though to be sure. One thing is that I’ve seen papers (can’t remember what) say that these Kronecker models start to not work so well when the dimensionality is greater than ~6. One possibility is to just code in each case and disallow dimensions > 6. Probably really uncommon. I’m sure two dimensions is very common, maybe three, and then maybe four starts to push it.

Looks like your fix does the trick. Yeah, it’s probably not going to get much faster (unless you can run on the GPU…? which may or may not work). If Eigh is the bottleneck then that’s pretty great. Have you seen MarginalKron? Also this PR for non-gaussian likelihoods here.

Edit: Saw your post on the MarginalKron PR, so yup you saw. Will leave the link for ppl reading.

Thanks again for your thoughts. I agree that it’s hard to see dimension greater than 4 being widely applicable; certainly in the use case I am considering (emulation of computer simulations), I think two would be the most common use case.