My gut reaction is that it’s typically shouldn’t be possible to beat numerical implementations in BLAS/LAPACK. I benchmarked your scan dot against a normal vectorized dot and didn’t find any speedup from using the scan:
from pytensor.graph.replace import vectorize_graph
from pytensor.compile.mode import get_default_mode
L = pt.dmatrix('L')
x = pt.dvector('x')
L_batched = pt.tensor('L_batched', shape=(None, None, None))
x_batched = pt.dmatrix('x_batched')
normal_dot = L @ x
fn_0 = pytensor.function([L_batched, x_batched],
vectorize_graph(normal_dot, {L:L_batched, x:x_batched}))
fn_1 = pytensor.function([L_batched, x_batched],
get_f_multi_pt(L_batched, x_batched))
n_id, n_x = int(1e6), 5
A_val = rng.normal(size=(n_x, n_x))
A_val = A_val @ A_val.T
L_val = linalg.cholesky(A_val, lower=True)
x_val = rng.normal(size=(n_x,))
L_batched_val = np.stack([L_val for _ in range(n_id)], axis=0)
x_batched_val = np.stack([x_val for _ in range(n_id)], axis=0)
%timeit fn_0(L_batched_val, x_batched_val)
# 19.9 ms ± 366 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit fn_1(L_batched_val, x_batched_val)
# 27.9 ms ± 589 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
My guess is that given your setup with a ton of really small problems (n_id >> n_x), you are either facing a memory bottleneck in the “normal” dot case, or you are paying overhead costs over and over going back and forth between python and compiled subroutines.
I also tested using trmv but I think this is always going to be slower in your case, since you’re dealing with lots of small matrices. If I were you the next thing I’d try is to think about forming a single huge matrix A = block_diag(Ls). That would be too huge to actually create, but if you think about it, the resulting matrix will be banded triangular, with L.shape[0] subdiagonals (assuming L is lower). So you could write that in banded form and try using dtbmv to do dtbmv(A, z_batched.ravel()).reshape(z_batched.shape). That might let you avoid all the back-and-forth that comes from your problem setup, while still exploiting the structure of L.
I’m not familiar with the Bareiss algorithm, but my quick search found that it was related to numerical precision, not necessarily to speed. Have you specifically benchmarked it against pt.linalg.cholesky?