Issues parallelizing pymc3 model with the `multiprocessing` library

Pardon if this is an underspecified question but I am attempting to parallelize the fitting of a model using the multiprocessing python library. Are there any known issues using this library to run pymc3 models in parallel? I am having an issue doing so, which I can’t figure out.

Context

I have 1k+ datasets I’m attempting to fit with the same model. All of the prior selection, model building, fitting and results summarizing I have in a single function fit_routine. I then parallelize the fitting with the following lines:

pool = mp.Pool(mp.cpu_count())
res = pool.starmap(fit_routine, [(i, config, pad_dict) for i in mpargs.items()])
pool.close()

Here config and pad_dict are two static objects that help specify the priors for each of the thousand fitting instances. The mpargs dictionary contains the info that varies from one fit instance to the next (namely, different datasets, as well as some identifying information used to organize the results).

I am 95% sure the details of fit_routine function work properly on all of the input data instances because I ran this whole routine serially on several hundred of the datasets before attempting to do so in parallel, and all instances returned sensible results with no errors.

Error

However, when running in parallel, with the above lines of code I get the following error:

ERROR (theano.graph.opt): Optimization failure due to: constant_folding
ERROR (theano.graph.opt): node: InplaceDimShuffle{}(TensorConstant{(1,) of -1.0})
ERROR (theano.graph.opt): TRACEBACK:
ERROR (theano.graph.opt): Traceback (most recent call last):
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/opt.py", line 2017, in process_node
    replacements = lopt.transform(fgraph, node)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/opt.py", line 1209, in transform
    return self.fn(*args, **kwargs)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/tensor/opt.py", line 7006, in constant_folding
    thunk = node.op.make_thunk(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/op.py", line 634, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/op.py", line 600, in make_c_thunk
    outputs = cl.make_thunk(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1203, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1138, in __compile__
    thunk, module = self.cthunk_factory(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1634, in cthunk_factory
    module = get_module_cache().module_from_key(key=key, lnk=self)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1157, in module_from_key
    module = self._get_from_hash(module_hash, key)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1069, in _get_from_hash
    self.check_key(key, key_data.key_pkl)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1241, in check_key
    key_data = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\x00'.

However, the script continues to run and will fit the first ~20 datasets before it stops being able to build the pymc3 model (the step before fitting) for the remaining 900+ datasets. I am certain it’s not a problem with the data because it doesn’t break on the same dataset every time, and because I can fit those datasets with the fit_routine function individually without issue.

Additional info

Some other details: I am using the ADVI method for the fitting process, not MCMC. I am running this on a serve with the Slurm job manager. I don’t believe this to be a memory issue because the Slurm job summary reports a max memory usage of about 6GB and there’s approximately 100GB available for the job.

The last line of the error makes me think it’s a pickling issue, but I don’t have the foggiest idea for why that would be the case. Any help would be greatly appreciated, thanks!

Versions:

print(sys.version, pm.__version__, pm.theano.__version__)

3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0] 3.11.2 1.1.2

As an addendum to the above, in lieu of the multiprocessing method, are there any recommended approaches to fitting pymc3 models in parallel on an N-cpu system?

For those curious, it turned out to be an issue with the theano compiler queue. Changing the time limit in .theanorc as suggested in the below link solved the problem:

2 Likes

Thanks for reporting the solution @jstanley. I opened an issue to increase the default: Increase default compile lock time-out · Issue #521 · aesara-devs/aesara · GitHub.

Great, thanks @twiecki .