I find that starting from 96 points multiprocessing GP suddenly becomes much slower compared to both singleprocessing and 95-points-multiprocessing. This is easily reproducible:
import numpy as np import pymc3 as pm n = 96 X = np.random.rand(n, 1) observed = np.random.rand(n) with pm.Model() as model: cov = pm.gp.cov.ExpQuad(1, 1) gp = pm.gp.Latent(mean_func=pm.gp.mean.Zero(), cov_func=cov) gp_f = gp.prior('gp_f', X=X) val_obs = pm.Normal('val_obs', mu=gp_f, sd=0.1, observed=observed) trace = pm.sample(njobs=2)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [gp_f_rotated_] Sampling 2 chains: 49%|████▉ | 981/2000 [01:14<01:32, 11.01draws/s]
While the same code with 95 instead of 96 finishes in 10 seconds. Or
njobs=1 instead of
2 finishes in 18 seconds for 96 points.
I notice that with 96 or more points each process is already multithreaded and takes 600% of a 6-core (12 with HT) processor. With 95 or less each process takes just 100% cpu.
Also this is certainly related to a older post of mine, Sampling doesn't start when njobs > 1 for some models. Then sampling didn’t even start in this case, and now it starts but goes extremely slow.