Very parallel MCMC sampling

colcarroll · August 19, 2019, 4:40pm

One reason I’m excited about PyMC4 using TensorFlow probability is the chance to run hundreds/thousands of chains. I write about this here:

https://colindcarroll.com/2019/08/18/very-parallel-mcmc-sampling

chartl · August 19, 2019, 6:26pm

It seems like most of the speedup just comes from one-shot sampling and pre-allocation of memory. It seems like this trick can be used in pymc3 just by adding a dimension and “cheating” by assigning the same observations to each dimension:

import numpy as np
import pymc3 as pm

mu_true = 3.0
sd_true = 0.4

observed = np.random.normal(size=(10,))*sd_true + mu_true

# 150 chains in 3 "chains"
CHEAT_CHAINS=50
with pm.Model() as mod:
    mu = pm.Normal('mu', 0, 1, shape=CHEAT_CHAINS)
    err = pm.HalfNormal('sd', 1, shape=CHEAT_CHAINS)
    for j in range(CHEAT_CHAINS):
        xi = pm.Normal('x_%d' % j, mu[j], err[j], observed=observed)
    tr = pm.sample(1000, chains=3, cores=3, tune=500)

# 00:44<00:00, 101.14draws/s
# 00:40<00:00, 109.96draws/s

# old-fashioned:
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2500, hard))
with pm.Model() as mod:
    mu = pm.Normal('mu', 0, 1)
    err = pm.HalfNormal('sd', 1)
    x = pm.Normal('x', mu, err, observed=observed)
    tr = pm.sample(1000, chains=CHEAT_CHAINS*3, cores=3, tune=500)

# 01:04<00:00, 3471.98draws/s
# 01:13<00:00, 3056.69draws/s

so a pretty decent speedup even using NUTS.

colcarroll · August 19, 2019, 6:46pm

That’s very creative!

You’re right about where the majority of the speedups come from. I think this will be more pronounced in HMC, where the gradient is also done in one shot (as above!). I actually wrote this code while working on unbiased MCMC with couplings, where I also get a speedup from reusing a Cholesky decomposition and a .solve.

chartl · August 19, 2019, 7:37pm

Also with regards to

Hopefully the linear algebra you used gives you performance gains, too.

I’ve noticed that the linear algebra libraries installed for most cloud instances I spin up are parallelized by default, e,g., PARPACK. I wonder i) if any of your observed speedup comes from “hidden” library-level parallelism; ii) whether pymc4 (or even pymc3) turns off fine-grained (library) parallelism when multiple chains are being run in parallel. I’ve seen cores=4 seemingly eat up all 8 cores, so maybe not?

junpenglao · August 20, 2019, 10:25am

Well in theory this is not correct because you modified the model log_prob and the dimension of the random variables. Not sure if there is a correct way to do it as we reduce sum the log_prob

chartl · August 22, 2019, 4:42pm

I think this works just fine. I’ve replaced the target likelihood \pi(x) by the product distribution \Pi(X) = \prod_{j=1}^n \pi(x_j), so HMC converges in probability to a sample over \Pi(X) - i.e., n i.i.d. samples from \pi(x).

Gon_F · August 22, 2019, 7:51pm

I sure hope the greatness hoped for in utilizing now Tensorflow Probability in Pymc instead of Theano won’t all amount to just using shortcuts that simply trade-off some aspects of model accuracy for others

Nevertheless, this analysis is interesting! Who knows where this can lead?

junpenglao · August 22, 2019, 8:50pm

Hmmm doesnt sound right to me, maybe it is valid using MH, but HMC amend the system, so now the sampler is moving in a space that is impossible if you are using a batch log_prob - I would imagine it reduces the efficiency of the sampler.

Topic		Replies	Views
Parallelizing chains with custom likelihood on multiple cores v5	29	3091	March 24, 2023
Nested Parallel in PyMC3 Questions	5	1069	December 6, 2020
Running Pymc3 fast posterior sample in a multi-node cluster getting stuck Questions	4	751	February 8, 2022
Using multiple chains in SMC in PyMC3 Questions smc_abc	9	1589	December 3, 2021
Can different samples be merged? Questions	12	2048	May 3, 2019

Very parallel MCMC sampling

Related topics