Maybe related to this thread.

Sampling when using NUTS and tensor.dot operators slows down considerably as more chains are sampled. Here is an example:

Data generation for simple linear models

```
import theano as theano
import theano.tensor as T
import numpy as np
import pymc3 as pm3
# test data
N = 1000
x1 = 10 + 2 * np.random.randn(N,1)
x2 = 5 + 2 * np.random.randn(N,1)
X = np.c_[np.ones_like(x1),x1,x2]
Y = X.dot([10,1,-1]) + 2*np.random.randn(N)
print(np.linalg.inv(X.T.dot(X)).dot(X.T.dot(Y)))
```

and creating two identical pymc3 models:

```
with pm3.Model() as model1:
beta = pm3.Flat('beta',shape=X.shape[1])
sigma = pm3.Exponential('sigma',lam=.1)
xb = T.dot(X, beta)
like = pm3.Normal('like', mu = xb, sd = sigma, observed=Y)
```

and

```
with pm3.Model() as model2:
beta = pm3.Flat('beta',shape=X.shape[1])
sigma = pm3.Exponential('sigma',lam=.1)
xb = beta[0] + beta[1]*X[:,1] + beta[2]*X[:,2]
like = pm3.Normal('like', mu = xb, sd = sigma, observed=Y)
```

and then sampling with chains=1, 2, and 3 for 100 samples and `tune=50`

. The times are given below:

Chains | Model 1 | Model 2 |
---|---|---|

1 | 813ms | 828ms |

2 | 1.23ms | 1.1 s |

3 | 1min 41s | 994 ms |

I should add that

- there are lots of free cpu’s and memory.
- the same timings occur if X is a tensor shared variable
- this doesn’t happen with Metropolis, but didn’t try any other samplers
- this doesn’t improve with more samples/tuning