Hi!

I find that starting from 96 points multiprocessing GP suddenly becomes much slower compared to both singleprocessing and 95-points-multiprocessing. This is easily reproducible:

```
import numpy as np
import pymc3 as pm
n = 96
X = np.random.rand(n, 1)
observed = np.random.rand(n)
with pm.Model() as model:
cov = pm.gp.cov.ExpQuad(1, 1)
gp = pm.gp.Latent(mean_func=pm.gp.mean.Zero(), cov_func=cov)
gp_f = gp.prior('gp_f', X=X)
val_obs = pm.Normal('val_obs', mu=gp_f, sd=0.1, observed=observed)
trace = pm.sample(njobs=2)
```

goes like

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [gp_f_rotated_]
Sampling 2 chains: 49%|████▉ | 981/2000 [01:14<01:32, 11.01draws/s]
```

While the same code with 95 instead of 96 finishes in 10 seconds. Or `njobs=1`

instead of `2`

finishes in 18 seconds for 96 points.

I notice that with 96 or more points each process is already multithreaded and takes 600% of a 6-core (12 with HT) processor. With 95 or less each process takes just 100% cpu.

Also this is certainly related to a older post of mine, Sampling doesn't start when njobs > 1 for some models. Then sampling didn’t even start in this case, and now it starts but goes extremely slow.