NUTS sampling performance when using tensor.dot operator

Maybe related to this thread.

Sampling when using NUTS and tensor.dot operators slows down considerably as more chains are sampled. Here is an example:

Data generation for simple linear models

import theano as theano 
import theano.tensor as T
import numpy as np
import pymc3 as pm3

# test data

N = 1000

x1 = 10 + 2 * np.random.randn(N,1)
x2 = 5 + 2 * np.random.randn(N,1)
X = np.c_[np.ones_like(x1),x1,x2]
Y = X.dot([10,1,-1]) + 2*np.random.randn(N)

print(np.linalg.inv(X.T.dot(X)).dot(X.T.dot(Y)))

and creating two identical pymc3 models:

with pm3.Model() as model1:
    beta = pm3.Flat('beta',shape=X.shape[1])
    sigma = pm3.Exponential('sigma',lam=.1)
    
    xb = T.dot(X, beta)
    
    like = pm3.Normal('like', mu = xb, sd = sigma, observed=Y)

and

with pm3.Model() as model2:
    beta = pm3.Flat('beta',shape=X.shape[1])
    sigma = pm3.Exponential('sigma',lam=.1)
    xb = beta[0] + beta[1]*X[:,1] + beta[2]*X[:,2]
    like = pm3.Normal('like', mu = xb, sd = sigma, observed=Y)

and then sampling with chains=1, 2, and 3 for 100 samples and tune=50. The times are given below:

Chains Model 1 Model 2
1 813ms 828ms
2 1.23ms 1.1 s
3 1min 41s 994 ms

I should add that

  1. there are lots of free cpu’s and memory.
  2. the same timings occur if X is a tensor shared variable
  3. this doesn’t happen with Metropolis, but didn’t try any other samplers
  4. this doesn’t improve with more samples/tuning
1 Like

My guess is that tensor.dot is already paralleled by BLAS (or the Linear Algebra engine of your computer), and when there is multiple chain and tensor.dot they are competing with resource. And it happens in NUTS because only the gradient operation is expensive/needed to run in parallel.

Unfortunately I have no good solution to this… it is almost the nature of the way pymc3 is set up that we have parallel chain on top of an already locally optimized library (theano).

While this is a trivial example, I have a few other models where avoiding linear algebra operations are difficult. In case anyone else runs into these issues, I’ve had success with gradiant free samplers like DEMetropolis (or Metropolis) step methods.