Gpu much slower than cpu

I just tested some sample code from

http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/

I tried both using gpu and cpu with identical code. This is the first time I observed cpu running much faster than gpu, by something that seems like 10 times. Is there some obvious reason why that is?

Running on GPU has a large overhead because the sampler are in Python (CPU) but the gradient and the logp function is under GPU (via theano). So unless you have very large matrix operations with expensive gradient computation that could leverage the GPU performance, using GPU is usually not worth.

That’s plausible. is this overhead consistent with the slowing down after the NUTS % counting started? I mean its not just in the beginning of the program execution that suffers from the cpu sampler overhead.

It will slow down the whole thing, not just the compiling. Because at each step, the data is copied to GPU for gradient evaluation, and copied back to CPU for leapfrog.