V4 gives slower sampling, using example from announcement

After migrating to V4, I noticed that sampling is slower for me on my laptop. I decided to run the radon example described in the V4 announcement (PyMC 4.0 Release Announcement — PyMC project website). I tried a few seeds, and V4 is consistently slower for me. Any ideas on why this might be happening?

I’ll share some output (see attachments for Python script):

$ python radon_v3.py
PyMC3 version: 3.11.2
Theano version: 1.1.2
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [eps, b, sigma_b, mu_b, a, sigma_a, mu_a]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 9 seconds.
The number of effective samples is smaller than 25% for some parameters.
Wall time for pm3.sample: 14.616650342941284

$ python radon_v4.py
PyMC version: 4.1.3
Aesara version: 2.7.7
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_a, sigma_a, a, mu_b, sigma_b, b, eps]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 17 seconds.
Wall time for pm.sample: 22.14641046524048

Edit: I am not using the JAX backend because my actual problem requires sampling latent discrete variables.

radon_v3.py (1.5 KB)
radon_v4.py (1.5 KB)

Two quick things. First, wall time is obviously important but doesn’t mean much when comparing reasonably similar PPLs/models/sampling algorithms. I am seeing considerably higher ESS (e.g., 2x higher overall) and lower \hat{r} from v4 than from v3. Second, with v3 I am seeing lower-than-target acceptance rates and small numbers of divergences, both of which can artificially yield “faster” sampling at the expense of useful information (consistent with the lower ESS).

3 Likes

I was wondering whether I have not configured something correctly, e.g. PyMC 4.0 is not using some feature that PyMC3 is using. Otherwise, I was looking for an explanation of what has changed in PyMC 4.0 that might explain the slow down.

I’m checking the claim of higher ESS in v4 with the same example. Here are some results averaged over 10 runs (4 chains per run, 1000 + 1000 draws per chain):

PyMC3
Average wall time (s): 12.418181300163269
Average sampling time (s): 7.8218183517456055
Average min ESS: 633.3899494222175

PyMC 4.0
Average wall time (s): 19.462969613075256
Average sampling time (s): 14.757731986045837
Average min ESS: 618.1402097314276

(Minimum is taken over all parameters; this is now run on a HPC, so times are faster than original post)

There is a significant difference in time, but not in ESS. Why would we expect v4 to have better efficiency?

Perhaps this is not a good example for comparison? Please let me know if there are other examples. For my own problem, I have noticed that v4 is slower and I don’t know why (which makes me doubt whether the migration is worth).

1 Like

Not sure how informative the minimum ESS is going to be. You are likely focusing on a single parameter at that point. A parameter-wise comparison would probably be more informative. But it is possible that something is configured such that the differences you are seeing are consistent. There are plenty of theano/aesara configuration options, for example.

1 Like