NUTS sampling performance when using tensor.dot operator

Rob_Hicks · April 30, 2019, 8:11pm

Maybe related to this thread.

Sampling when using NUTS and tensor.dot operators slows down considerably as more chains are sampled. Here is an example:

Data generation for simple linear models

import theano as theano 
import theano.tensor as T
import numpy as np
import pymc3 as pm3

# test data

N = 1000

x1 = 10 + 2 * np.random.randn(N,1)
x2 = 5 + 2 * np.random.randn(N,1)
X = np.c_[np.ones_like(x1),x1,x2]
Y = X.dot([10,1,-1]) + 2*np.random.randn(N)

print(np.linalg.inv(X.T.dot(X)).dot(X.T.dot(Y)))

and creating two identical pymc3 models:

with pm3.Model() as model1:
    beta = pm3.Flat('beta',shape=X.shape[1])
    sigma = pm3.Exponential('sigma',lam=.1)
    
    xb = T.dot(X, beta)
    
    like = pm3.Normal('like', mu = xb, sd = sigma, observed=Y)

and

with pm3.Model() as model2:
    beta = pm3.Flat('beta',shape=X.shape[1])
    sigma = pm3.Exponential('sigma',lam=.1)
    xb = beta[0] + beta[1]*X[:,1] + beta[2]*X[:,2]
    like = pm3.Normal('like', mu = xb, sd = sigma, observed=Y)

and then sampling with chains=1, 2, and 3 for 100 samples and tune=50. The times are given below:

Chains	Model 1	Model 2
1	813ms	828ms
2	1.23ms	1.1 s
3	1min 41s	994 ms

I should add that

there are lots of free cpu’s and memory.
the same timings occur if X is a tensor shared variable
this doesn’t happen with Metropolis, but didn’t try any other samplers
this doesn’t improve with more samples/tuning

junpenglao · May 25, 2019, 6:26am

My guess is that tensor.dot is already paralleled by BLAS (or the Linear Algebra engine of your computer), and when there is multiple chain and tensor.dot they are competing with resource. And it happens in NUTS because only the gradient operation is expensive/needed to run in parallel.

Unfortunately I have no good solution to this… it is almost the nature of the way pymc3 is set up that we have parallel chain on top of an already locally optimized library (theano).

Rob_Hicks · May 26, 2019, 7:49am

While this is a trivial example, I have a few other models where avoiding linear algebra operations are difficult. In case anyone else runs into these issues, I’ve had success with gradiant free samplers like DEMetropolis (or Metropolis) step methods.

Topic		Replies	Views
Super slow sampling Questions	0	696	August 8, 2018
Optimization suggestion for Hierarchical Model using NUTS on CPU/GPU Questions gpu , theano	7	1349	June 4, 2018
Avoiding non-NUTS samplers when broadcasting a multiplication operation v5	10	171	April 13, 2024
Sampling became much slower after upgrading pymc3 Questions gpu	3	2127	April 12, 2020
PyMC3 slows rapidly with increasing numbers of parameters Questions	5	11075	December 3, 2018

NUTS sampling performance when using tensor.dot operator

Related topics