Pardon if this is an underspecified question but I am attempting to parallelize the fitting of a model using the multiprocessing
python library. Are there any known issues using this library to run pymc3 models in parallel? I am having an issue doing so, which I can’t figure out.
Context
I have 1k+ datasets I’m attempting to fit with the same model. All of the prior selection, model building, fitting and results summarizing I have in a single function fit_routine
. I then parallelize the fitting with the following lines:
pool = mp.Pool(mp.cpu_count())
res = pool.starmap(fit_routine, [(i, config, pad_dict) for i in mpargs.items()])
pool.close()
Here config
and pad_dict
are two static objects that help specify the priors for each of the thousand fitting instances. The mpargs
dictionary contains the info that varies from one fit instance to the next (namely, different datasets, as well as some identifying information used to organize the results).
I am 95% sure the details of fit_routine
function work properly on all of the input data instances because I ran this whole routine serially on several hundred of the datasets before attempting to do so in parallel, and all instances returned sensible results with no errors.
Error
However, when running in parallel, with the above lines of code I get the following error:
ERROR (theano.graph.opt): Optimization failure due to: constant_folding
ERROR (theano.graph.opt): node: InplaceDimShuffle{}(TensorConstant{(1,) of -1.0})
ERROR (theano.graph.opt): TRACEBACK:
ERROR (theano.graph.opt): Traceback (most recent call last):
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/opt.py", line 2017, in process_node
replacements = lopt.transform(fgraph, node)
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/opt.py", line 1209, in transform
return self.fn(*args, **kwargs)
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/tensor/opt.py", line 7006, in constant_folding
thunk = node.op.make_thunk(
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/op.py", line 634, in make_thunk
return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/op.py", line 600, in make_c_thunk
outputs = cl.make_thunk(
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1203, in make_thunk
cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1138, in __compile__
thunk, module = self.cthunk_factory(
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1634, in cthunk_factory
module = get_module_cache().module_from_key(key=key, lnk=self)
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1157, in module_from_key
module = self._get_from_hash(module_hash, key)
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1069, in _get_from_hash
self.check_key(key, key_data.key_pkl)
File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1241, in check_key
key_data = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\x00'.
However, the script continues to run and will fit the first ~20 datasets before it stops being able to build the pymc3 model (the step before fitting) for the remaining 900+ datasets. I am certain it’s not a problem with the data because it doesn’t break on the same dataset every time, and because I can fit those datasets with the fit_routine
function individually without issue.
Additional info
Some other details: I am using the ADVI method for the fitting process, not MCMC. I am running this on a serve with the Slurm job manager. I don’t believe this to be a memory issue because the Slurm job summary reports a max memory usage of about 6GB and there’s approximately 100GB available for the job.
The last line of the error makes me think it’s a pickling issue, but I don’t have the foggiest idea for why that would be the case. Any help would be greatly appreciated, thanks!
Versions:
print(sys.version, pm.__version__, pm.theano.__version__)
3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0] 3.11.2 1.1.2