Maybe @JWarmenhoven could provide some comments here, but in general there should be no differences using a Bernoulli or Binomial likelihood, if you dont have additional trial level information.
Also, the slowness of Tensor.dot
is likely GPU related, as in PyMC3 you mostly get better performance by using CPU only.