Very parallel MCMC sampling

One reason I’m excited about PyMC4 using TensorFlow probability is the chance to run hundreds/thousands of chains. I write about this here:


It seems like most of the speedup just comes from one-shot sampling and pre-allocation of memory. It seems like this trick can be used in pymc3 just by adding a dimension and “cheating” by assigning the same observations to each dimension:

import numpy as np
import pymc3 as pm

mu_true = 3.0
sd_true = 0.4

observed = np.random.normal(size=(10,))*sd_true + mu_true

# 150 chains in 3 "chains"
with pm.Model() as mod:
    mu = pm.Normal('mu', 0, 1, shape=CHEAT_CHAINS)
    err = pm.HalfNormal('sd', 1, shape=CHEAT_CHAINS)
    for j in range(CHEAT_CHAINS):
        xi = pm.Normal('x_%d' % j, mu[j], err[j], observed=observed)
    tr = pm.sample(1000, chains=3, cores=3, tune=500)

# 00:44<00:00, 101.14draws/s
# 00:40<00:00, 109.96draws/s

# old-fashioned:
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2500, hard))
with pm.Model() as mod:
    mu = pm.Normal('mu', 0, 1)
    err = pm.HalfNormal('sd', 1)
    x = pm.Normal('x', mu, err, observed=observed)
    tr = pm.sample(1000, chains=CHEAT_CHAINS*3, cores=3, tune=500)

# 01:04<00:00, 3471.98draws/s
# 01:13<00:00, 3056.69draws/s

so a pretty decent speedup even using NUTS.


That’s very creative!

You’re right about where the majority of the speedups come from. I think this will be more pronounced in HMC, where the gradient is also done in one shot (as above!). I actually wrote this code while working on unbiased MCMC with couplings, where I also get a speedup from reusing a Cholesky decomposition and a .solve.


Also with regards to

Hopefully the linear algebra you used gives you performance gains, too.

I’ve noticed that the linear algebra libraries installed for most cloud instances I spin up are parallelized by default, e,g., PARPACK. I wonder i) if any of your observed speedup comes from “hidden” library-level parallelism; ii) whether pymc4 (or even pymc3) turns off fine-grained (library) parallelism when multiple chains are being run in parallel. I’ve seen cores=4 seemingly eat up all 8 cores, so maybe not?

1 Like

Well in theory this is not correct because you modified the model log_prob and the dimension of the random variables. Not sure if there is a correct way to do it as we reduce sum the log_prob


I think this works just fine. I’ve replaced the target likelihood \pi(x) by the product distribution \Pi(X) = \prod_{j=1}^n \pi(x_j), so HMC converges in probability to a sample over \Pi(X) - i.e., n i.i.d. samples from \pi(x).

I sure hope the greatness hoped for in utilizing now Tensorflow Probability in Pymc instead of Theano won’t all amount to just using shortcuts that simply trade-off some aspects of model accuracy for others :frowning:

Nevertheless, this analysis is interesting! Who knows where this can lead?

1 Like

Hmmm doesnt sound right to me, maybe it is valid using MH, but HMC amend the system, so now the sampler is moving in a space that is impossible if you are using a batch log_prob - I would imagine it reduces the efficiency of the sampler.