PyMC3 Pickle Issue with GPU

I have recently installed PyMC3 under a Python 2.7 environment on my GPU-enhanced desktop PC. The installation was successful and I have been able to run the following sample code:

import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
import seaborn as sns
import time
#print('Running on PyMC3 v{}'.format(pm.__version__))
plt.style.use('seaborn-darkgrid')

# Initialize random number generator
np.random.seed(123)

# True parameter values
alpha, sigma = 1, 1
beta = [1, 2.5]

# Size of dataset
size = 100

# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

# Simulate outcome variable
Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma

fig, axes = plt.subplots(1, 2, sharex=True, figsize=(10,4))
axes[0].scatter(X1, Y)
axes[1].scatter(X2, Y)
axes[0].set_ylabel('Y'); axes[0].set_xlabel('X1'); axes[1].set_xlabel('X2');
plt.show()

basic_model = pm.Model()
with basic_model:

    # Priors for unknown model parameters
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    sigma = pm.HalfNormal('sigma', sd=1)
    #sigma = pm.HalfNormal('sigma', sd=2)

    # Expected value of outcome
    mu = alpha + beta[0]*X1 + beta[1]*X2

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

map_estimate = pm.find_MAP(model=basic_model, method='powell')

print "Map estimate = " , map_estimate

time1 = time.clock()

with basic_model:
    # draw 1000 posterior samples
    trace = pm.sample(1000)
    #mean_field = pm.fit(method='advi')

time2 = time.clock()

print 'sample time = ', time2 - time1, ' seconds'

pm.traceplot(trace);
#pm.plot_posterior(mean_field.sample(1000), color='LightSeaGreen');
plt.show()

When I run on the CPU using device = cpu in my .theanorc.txt file, I get the following:

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
  0%|                                                                                         | 0/5000 [00:00<?, ?it/s]D:\Anaconda3\envs\pymc3_env_2_7\lib\site-packages\scipy\optimize\_minimize.py:381: RuntimeWarning: Method powell does not use gradient information (jac).
  RuntimeWarning)
logp = -149.47, ||grad|| = 13.25: 100%|############################################| 177/177 [00:00<00:00, 3765.96it/s]
Map estimate =  {'alpha': array(0.9090678691864014, dtype=float32), 'beta': array([ 0.9514268 ,  2.61449409], dtype=float32), 'sigma': array(0.9656924605369568, dtype=float32), 'sigma_log__': array(-0.03490985184907913, dtype=float32)}
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, beta, alpha]
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
  0%|                                                                                         | 0/1500 [00:00<?, ?it/s]WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
100%|#############################################################################| 1500/1500 [00:02<00:00, 635.06it/s]
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
sample time =  24.7703935489  seconds

However, when I set device = cuda in .theanorc.txt, I get the following:

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Using cuDNN version 5105 on context None
Mapped name None to device cuda: GeForce GTX 950 (0000:03:00.0)
  0%|                                                                                         | 0/5000 [00:00<?, ?it/s]D:\Anaconda3\envs\pymc3_env_2_7\lib\site-packages\scipy\optimize\_minimize.py:381: RuntimeWarning: Method powell does not use gradient information (jac).
  RuntimeWarning)
logp = -148.98, ||grad|| = 0.73744: 100%|###########################################| 183/183 [00:01<00:00, 167.74it/s]
Map estimate =  {'alpha': array(0.9090930819511414, dtype=float32), 'beta': array([ 0.9514547 ,  2.61456656], dtype=float32), 'sigma': array(0.9656581282615662, dtype=float32), 'sigma_log__': array(-0.03494539484381676, dtype=float32)}
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, beta, alpha]
Could not pickle model, sampling singlethreaded.
Sequential sampling (4 chains in 1 job)
NUTS: [sigma, beta, alpha]
100%|#############################################################################| 1500/1500 [00:07<00:00, 191.67it/s]
100%|#############################################################################| 1500/1500 [00:07<00:00, 207.41it/s]
100%|#############################################################################| 1500/1500 [00:07<00:00, 200.51it/s]
100%|#############################################################################| 1500/1500 [00:09<00:00, 162.73it/s]
sample time =  40.0180990856  seconds

All outputs and charts appear to be correct and agree very well. My questions are:

  1. I observe that the run is significantly faster using the CPU. Is this normal behavior?
  2. What is going on with the ‘Could not pickle model, sampling singlethreaded.’ output line? Is there something I can do to make the GPU run faster?

Not sure about this one - your model seems quite straightforward - did you try upgrading to the newest release PyMC3.5?

I am using PyMC3 version 3.5rc1

I guess you saw it on Github as well, but in case anybody else are interested:

I think I know why. Py3 properly supports parallel sampling, which would only work if you had 4 GPUs. So try setting jobs=1.

1 Like