NUTS sampling performance when using tensor.dot operator

My guess is that tensor.dot is already paralleled by BLAS (or the Linear Algebra engine of your computer), and when there is multiple chain and tensor.dot they are competing with resource. And it happens in NUTS because only the gradient operation is expensive/needed to run in parallel.

Unfortunately I have no good solution to this… it is almost the nature of the way pymc3 is set up that we have parallel chain on top of an already locally optimized library (theano).