Why would scipy/numpy wrapped in @as_op cause much faster sampling than using pytensor operations?

The gradients may be very costly to compute, but you shouldn’t compare nuts vs metropolis+slice based on runtime but on effective sample size. It doesn’t matter if Metropolis is 3 times as fast if it gives you 6x less effective samples.

About your questions, no need to wrap in tensor variables, that’s down automatically by PyMC already.