The current set up does not take the full advantage of GPU due to the limitation of theano - if you use GPU PyMC3 produce a lot of overhead copying data back and forth between GPU and CPU.
Preferably you should always use multiple chains, if you see the sampler hangs doing chains=4
etc, that usually means that some of the chains are in the region of the parameter space that is difficult to sample for whatever reason (which is an indication you should improve the model). So in general you should do something like trace = pm.sample(1000, tune=1000, cores=4, chains=4)