OriolAbrils solution worked, however the combination of a finite difference gradient and the NUTS/HMC sampler in my case is still ~5x slower than the Metropolis sampler (without a gradient). But the endeavor was totally worth it and I learned a lot. Thanks for your quick help!
A little side note: I found the warning
Initializing NUTS failed. Falling back to elementwise auto-assignment.
somewhat misleading, as it for me implied that “elementwise auto-assignment” was a way of using NUTS without initializing. Additionally, it might be worth catching theano’s NotImplementedError() and supply it with a message to the user that this is or could be about a missing evaluation of the gradient of the likelihood function.