Leveraging the GPU in PyMC3

Hi everyone,

I’m new to PyMC3 and have been working to build a docker image that allows me to run Jupyter notebooks in the cloud on p2 AWS instances so that Theano can exploit the GPU. After finally getting the Theano test code to execute successfully on the GPU, I took the next step and tried running a sample PyMC3 example notebook in the same environment. In particular, this notebook from the PyMC3 repo.

The notebook successfully executed until hitting cell 8 with the code

with model:
    trace = pm.sample(2000, njobs=2)

which resulted in the runtime error that follows below. I’m at a loss about how to diagnose the underlying issue. Any thoughts on how to proceed? Is there a particular example somewhere about how to configure PyMC3 and Theano to play nice with one another?

Many thanks, Chris

Auto-assigning NUTS sampler…
Initializing NUTS using advi…

RuntimeErrorTraceback (most recent call last)
in ()
1 with model:
----> 2 trace = pm.sample(2000, njobs=2)

/opt/anaconda/lib/python3.6/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain, njobs, tune, progressbar, model, random_seed)
147 # By default, use NUTS sampler
148 pm.log.info(‘Auto-assigning NUTS sampler…’)
–> 149 start
, step = init_nuts(init=init, n_init=n_init, model=model)
150 if start is None:
151 start = start_

/opt/anaconda/lib/python3.6/site-packages/pymc3/sampling.py in init_nuts(init, n_init, model, **kwargs)
433 if init == ‘advi’:
–> 434 v_params = pm.variational.advi(n=n_init)
435 start = pm.variational.sample_vp(v_params, 1, progressbar=False, hide_transformed=False)[0]
436 cov = np.power(model.dict_to_array(v_params.stds), 2)

/opt/anaconda/lib/python3.6/site-packages/pymc3/variational/advi.py in advi(vars, start, model, n, accurate_elbo, optimizer, learning_rate, epsilon, random_seed)
138 elbo = pm.CallableTensor(elbo)(uw_shared)
139 updates = optimizer(loss=-1 * elbo, param=[uw_shared])
–> 140 f = theano.function([], [uw_shared, elbo], updates=updates)
142 # Optimization loop

/opt/anaconda/lib/python3.6/site-packages/theano/compile/function.py in function(inputs, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input)
324 on_unused_input=on_unused_input,
325 profile=profile,
–> 326 output_keys=output_keys)
327 # We need to add the flag check_aliased inputs if we have any mutable or
328 # borrowed used defined inputs

/opt/anaconda/lib/python3.6/site-packages/theano/compile/pfunc.py in pfunc(params, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input, output_keys)
484 accept_inplace=accept_inplace, name=name,
485 profile=profile, on_unused_input=on_unused_input,
–> 486 output_keys=output_keys)

/opt/anaconda/lib/python3.6/site-packages/theano/compile/function_module.py in orig_function(inputs, outputs, mode, accept_inplace, name, profile, on_unused_input, output_keys)
1793 on_unused_input=on_unused_input,
1794 output_keys=output_keys).create(
-> 1795 defaults)
1797 t2 = time.time()

/opt/anaconda/lib/python3.6/site-packages/theano/compile/function_module.py in create(self, input_storage, trustme, storage_map)
1659 theano.config.traceback.limit = theano.config.traceback.compile_limit
1660 _fn, _i, _o = self.linker.make_thunk(
-> 1661 input_storage=input_storage_lists, storage_map=storage_map)
1662 finally:
1663 theano.config.traceback.limit = limit_orig

/opt/anaconda/lib/python3.6/site-packages/theano/gof/link.py in make_thunk(self, input_storage, output_storage, storage_map)
697 return self.make_all(input_storage=input_storage,
698 output_storage=output_storage,
–> 699 storage_map=storage_map)[:3]
701 def make_all(self, input_storage, output_storage):

/opt/anaconda/lib/python3.6/site-packages/theano/gof/vm.py in make_all(self, profiler, input_storage, output_storage, storage_map)
1045 compute_map,
1046 no_recycling,
-> 1047 impl=impl))
1048 linker_make_thunk_time[node] = time.time() - thunk_start
1049 if not hasattr(thunks[-1], ‘lazy’):

/opt/anaconda/lib/python3.6/site-packages/theano/gof/op.py in make_thunk(self, node, storage_map, compute_map, no_recycling, impl)
933 try:
934 return self.make_c_thunk(node, storage_map, compute_map,
–> 935 no_recycling)
936 except (NotImplementedError, utils.MethodNotDefined):
937 # We requested the c code, so don’t catch the error.

/opt/anaconda/lib/python3.6/site-packages/theano/gof/op.py in make_c_thunk(self, node, storage_map, compute_map, no_recycling)
837 _logger.debug(‘Trying CLinker.make_thunk’)
838 outputs = cl.make_thunk(input_storage=node_input_storage,
–> 839 output_storage=node_output_storage)
840 fill_storage, node_input_filters, node_output_filters = outputs

/opt/anaconda/lib/python3.6/site-packages/theano/gof/cc.py in make_thunk(self, input_storage, output_storage, storage_map, keep_lock)
1188 cthunk, in_storage, out_storage, error_storage = self.compile(
1189 input_storage, output_storage, storage_map,
-> 1190 keep_lock=keep_lock)
1192 res = _CThunk(cthunk, init_tasks, tasks, error_storage)

/opt/anaconda/lib/python3.6/site-packages/theano/gof/cc.py in compile(self, input_storage, output_storage, storage_map, keep_lock)
1129 output_storage,
1130 storage_map,
-> 1131 keep_lock=keep_lock)
1132 return (thunk,
1133 [link.Container(input, storage) for input, storage in

/opt/anaconda/lib/python3.6/site-packages/theano/gof/cc.py in cthunk_factory(self, error_storage, in_storage, out_storage, storage_map, keep_lock)
1602 ret = module.instantiate(error_storage,
-> 1603 *(in_storage + out_storage + orphd))
1605 return ret

RuntimeError: (‘The following error happened while compiling the node’, GpuElemwise{Composite{(Switch(i0, i1, i2) + (i3 * i4) + (i5 * i6 * scalar_psi(i7) * i4) + ((i8 * i6 * i4) / i9) + (i10 * i6 * scalar_psi(i11) * i4) + (i12 * i13 * i4) + ((i14 / i9) * i4) + (i15 * i16 * scalar_psi(i7) * i4) + ((i17 * i16 * i4) / i9) + (i18 * i16 * scalar_psi(i11) * i4) + (i19 * i20 * i4) + ((i21 / i9) * i4))}}[(0, 3)](GpuElemwise{Composite{Identity(GT(i0, i1))}}[].0, GpuElemwise{mul,no_inplace}.0, GpuArrayConstant{0}, GpuReshape{0}.0, GpuElemwise{exp,no_inplace}.0, GpuArrayConstant{0.5}, GpuCAReduceCuda{add}.0, GpuElemwise{mul,no_inplace}.0, GpuArrayConstant{-0.5}, GpuElemwise{add,no_inplace}.0, GpuArrayConstant{-0.5}, GpuElemwise{mul,no_inplace}.0, GpuArrayConstant{0.5}, GpuCAReduceCuda{add}.0, GpuCAReduceCuda{add}.0, GpuArrayConstant{0.5}, GpuCAReduceCuda{add}.0, GpuArrayConstant{-0.5}, GpuArrayConstant{-0.5}, GpuArrayConstant{0.5}, GpuCAReduceCuda{add}.0, GpuCAReduceCuda{add}.0), ‘\n’, ‘Could not initialize elemwise support’)

Hi Chris!

Can you try without njobs=2? I think that will cause problems on a single GPU. Also, make sure to update to recent master as @aseyboldt did some refactoring of NUTS to help with GPU acceleration.


1 Like

The error Could not initialize elemwise support sounds like theano itself might have some trouble. Can you try a minimal theano example with the gpu to see if it works on its own? What did you specify in .theanorc?
Also, you could try do disable advi for now. If you use pymc3 master, that usually isn’t necessary anymore, init='adapt_diag' should work.

1 Like

Hi Thomas and Adrian,

So updating to the latest master and removing njobs=2 did the trick! It sounds like Theano will take advantage of multiple GPUs then? I’m currently running a p2.xlarge instance but can easily switch that to a p2.8xlarge.

Appreciate your quick response! Definitely looking forward to diving into PyMC3.

Have a great weekend,


1 Like

I wouldn’t expect multi gpu support to work well without spending quite a bit of time profiling, and only if the problem is sufficiently large and has a structure that makes it easy to divide the logp gradient evaluations nicely. Even for the one gpu case it is not at all obvious that nuts will run faster on a gpu than on a cpu, it really depends on the model.
Documentation for using theano with multiple gpus is here. pymc3 variables are subclasses of thenao vars, so you can use var.transfer on them the same way you would with ordinary theano vars. (At least as far as I know, I don’t have multiple gpus to test this). You can decide where the observations are stored by using a theano.shared with target='whatever'.


Thanks for the feedback Adrian. Sounds like there’s no immediate benefit then to running a p2.8xlarge instance. That’ll definitely save on cost.