One reason I’m excited about PyMC4 using TensorFlow probability is the chance to run hundreds/thousands of chains. I write about this here:

# Very parallel MCMC sampling

It seems like most of the speedup just comes from one-shot sampling and pre-allocation of memory. It seems like this trick can be used in pymc3 just by adding a dimension and “cheating” by assigning the same observations to each dimension:

```
import numpy as np
import pymc3 as pm
mu_true = 3.0
sd_true = 0.4
observed = np.random.normal(size=(10,))*sd_true + mu_true
# 150 chains in 3 "chains"
CHEAT_CHAINS=50
with pm.Model() as mod:
mu = pm.Normal('mu', 0, 1, shape=CHEAT_CHAINS)
err = pm.HalfNormal('sd', 1, shape=CHEAT_CHAINS)
for j in range(CHEAT_CHAINS):
xi = pm.Normal('x_%d' % j, mu[j], err[j], observed=observed)
tr = pm.sample(1000, chains=3, cores=3, tune=500)
# 00:44<00:00, 101.14draws/s
# 00:40<00:00, 109.96draws/s
# old-fashioned:
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2500, hard))
with pm.Model() as mod:
mu = pm.Normal('mu', 0, 1)
err = pm.HalfNormal('sd', 1)
x = pm.Normal('x', mu, err, observed=observed)
tr = pm.sample(1000, chains=CHEAT_CHAINS*3, cores=3, tune=500)
# 01:04<00:00, 3471.98draws/s
# 01:13<00:00, 3056.69draws/s
```

so a pretty decent speedup even using NUTS.

That’s very creative!

You’re right about where the majority of the speedups come from. I think this will be more pronounced in HMC, where the gradient is also done in one shot (as above!). I actually wrote this code while working on unbiased MCMC with couplings, where I also get a speedup from reusing a Cholesky decomposition and a `.solve`

.

Also with regards to

Hopefully the linear algebra you used gives you performance gains, too.

I’ve noticed that the linear algebra libraries installed for most cloud instances I spin up are parallelized by default, e,g., PARPACK. I wonder i) if any of your observed speedup comes from “hidden” library-level parallelism; ii) whether pymc4 (or even pymc3) turns off fine-grained (library) parallelism when multiple chains are being run in parallel. I’ve seen `cores=4`

seemingly eat up all 8 cores, so maybe not?

Well in theory this is not correct because you modified the model log_prob and the dimension of the random variables. Not sure if there is a correct way to do it as we reduce sum the log_prob

I think this works just fine. I’ve replaced the target likelihood \pi(x) by the product distribution \Pi(X) = \prod_{j=1}^n \pi(x_j), so HMC converges in probability to a sample over \Pi(X) - i.e., n i.i.d. samples from \pi(x).

I sure hope the greatness hoped for in utilizing now Tensorflow Probability in Pymc instead of Theano won’t all amount to just using shortcuts that simply trade-off some aspects of model accuracy for others

Nevertheless, this analysis is interesting! Who knows where this can lead?

Hmmm doesnt sound right to me, maybe it is valid using MH, but HMC amend the system, so now the sampler is moving in a space that is impossible if you are using a batch log_prob - I would imagine it reduces the efficiency of the sampler.