Issues parallelizing pymc3 model with the `multiprocessing` library

jstanley · June 22, 2021, 9:44pm

Pardon if this is an underspecified question but I am attempting to parallelize the fitting of a model using the multiprocessing python library. Are there any known issues using this library to run pymc3 models in parallel? I am having an issue doing so, which I can’t figure out.

Context

I have 1k+ datasets I’m attempting to fit with the same model. All of the prior selection, model building, fitting and results summarizing I have in a single function fit_routine. I then parallelize the fitting with the following lines:

pool = mp.Pool(mp.cpu_count())
res = pool.starmap(fit_routine, [(i, config, pad_dict) for i in mpargs.items()])
pool.close()

Here config and pad_dict are two static objects that help specify the priors for each of the thousand fitting instances. The mpargs dictionary contains the info that varies from one fit instance to the next (namely, different datasets, as well as some identifying information used to organize the results).

I am 95% sure the details of fit_routine function work properly on all of the input data instances because I ran this whole routine serially on several hundred of the datasets before attempting to do so in parallel, and all instances returned sensible results with no errors.

Error

However, when running in parallel, with the above lines of code I get the following error:

ERROR (theano.graph.opt): Optimization failure due to: constant_folding
ERROR (theano.graph.opt): node: InplaceDimShuffle{}(TensorConstant{(1,) of -1.0})
ERROR (theano.graph.opt): TRACEBACK:
ERROR (theano.graph.opt): Traceback (most recent call last):
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/opt.py", line 2017, in process_node
    replacements = lopt.transform(fgraph, node)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/opt.py", line 1209, in transform
    return self.fn(*args, **kwargs)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/tensor/opt.py", line 7006, in constant_folding
    thunk = node.op.make_thunk(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/op.py", line 634, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/graph/op.py", line 600, in make_c_thunk
    outputs = cl.make_thunk(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1203, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1138, in __compile__
    thunk, module = self.cthunk_factory(
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/basic.py", line 1634, in cthunk_factory
    module = get_module_cache().module_from_key(key=key, lnk=self)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1157, in module_from_key
    module = self._get_from_hash(module_hash, key)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1069, in _get_from_hash
    self.check_key(key, key_data.key_pkl)
  File "/Users/jast/miniconda3/envs/liet/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1241, in check_key
    key_data = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\x00'.

However, the script continues to run and will fit the first ~20 datasets before it stops being able to build the pymc3 model (the step before fitting) for the remaining 900+ datasets. I am certain it’s not a problem with the data because it doesn’t break on the same dataset every time, and because I can fit those datasets with the fit_routine function individually without issue.

Additional info

Some other details: I am using the ADVI method for the fitting process, not MCMC. I am running this on a serve with the Slurm job manager. I don’t believe this to be a memory issue because the Slurm job summary reports a max memory usage of about 6GB and there’s approximately 100GB available for the job.

The last line of the error makes me think it’s a pickling issue, but I don’t have the foggiest idea for why that would be the case. Any help would be greatly appreciated, thanks!

Versions:

print(sys.version, pm.__version__, pm.theano.__version__)

3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0] 3.11.2 1.1.2

jstanley · June 22, 2021, 11:08pm

As an addendum to the above, in lieu of the multiprocessing method, are there any recommended approaches to fitting pymc3 models in parallel on an N-cpu system?

jstanley · July 9, 2021, 7:18pm

For those curious, it turned out to be an issue with the theano compiler queue. Changing the time limit in .theanorc as suggested in the below link solved the problem:

github.com/pymc-devs/pymc3

Running multiple instances of Pymc3 scripts simultaneously causes error!

opened 02:10AM - 19 Oct 16 UTC

closed 12:10PM - 22 Dec 18 UTC

parashardhapola

Hi, Please see the error log below. ``` INFO (theano.gof.compilelock): Waitin…g for existing lock by unknown process (I am process '16768') INFO (theano.gof.compilelock): To manually release the lock, delete /home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '31361' (I am process '16768') INFO (theano.gof.compilelock): To manually release the lock, delete /home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '31931' (I am process '16768') INFO (theano.gof.compilelock): To manually release the lock, delete /home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '81671' (I am process '16768') INFO (theano.gof.compilelock): To manually release the lock, delete /home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/lock_dir INFO (theano.gof.compilelock): Waiting for existing lock by process '81592' (I am process '16768') INFO (theano.gof.compilelock): To manually release the lock, delete /home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/lock_dir Traceback (most recent call last): File "G4_Seq_QG_overlap_switchpoint.py", line 83, in <module> traces.append(get_switchpoint(read_level_data[l.name][1].copy())) File "G4_Seq_QG_overlap_switchpoint.py", line 43, in get_switchpoint step = pm.Metropolis([early_rate, late_rate, switchpoint]) File "/home/parashar/anaconda3/lib/python3.5/site-packages/pymc3/step_methods/arraystep.py", line 60, in __new__ step.__init__([var], *args, **kwargs) File "/home/parashar/anaconda3/lib/python3.5/site-packages/pymc3/step_methods/metropolis.py", line 110, in __init__ self.delta_logp = delta_logp(model.logpt, vars, shared) File "/home/parashar/anaconda3/lib/python3.5/site-packages/pymc3/step_methods/metropolis.py", line 310, in delta_logp f = theano.function([inarray1, inarray0], logp1 - logp0) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/compile/function.py", line 320, in function output_keys=output_keys) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/compile/pfunc.py", line 479, in pfunc output_keys=output_keys) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 1777, in orig_function defaults) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 1641, in create input_storage=input_storage_lists, storage_map=storage_map) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/link.py", line 690, in make_thunk storage_map=storage_map)[:3] File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/vm.py", line 1003, in make_all no_recycling)) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 970, in make_thunk no_recycling) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/op.py", line 879, in make_c_thunk output_storage=node_output_storage) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cc.py", line 1200, in make_thunk keep_lock=keep_lock) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cc.py", line 1143, in __compile__ keep_lock=keep_lock) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cc.py", line 1595, in cthunk_factory key=key, lnk=self, keep_lock=keep_lock) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 1108, in module_from_key module = self._get_from_hash(module_hash, key, keep_lock=keep_lock) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 1008, in _get_from_hash key_data.add_key(key, save_pkl=bool(key[0])) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 483, in add_key self.save_pkl() File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 504, in save_pkl with open(self.key_pkl, 'wb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/tmpi22zbwyw/key.pkl' WARNING (theano.gof.cmodule): Removing key file /home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/tmpbnxffomo/key.pkl because the corresponding module is gone from the file system. WARNING (theano.gof.cmodule): A module that was loaded by this ModuleCache can no longer be read from file /home/parashar/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.7-Santiago-x86_64-3.5.2-64/tmpi22zbwyw/m5080109b7465bc969faf7603bf21e896.so... this could lead to problems. Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 1466, in _on_atexit self.clear_old() File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 1277, in clear_old cleanup=False) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 945, in refresh key_data.delete_keys_from(self.entry_from_key) File "/home/parashar/anaconda3/lib/python3.5/site-packages/theano/gof/cmodule.py", line 535, in delete_keys_from del entry_from_key[key] KeyError: (((12, (3, (3, (4,), (4,)), (4,), (4,), (4,), (4,), (4,), (4,), (4,), (4,), (4,), (4,)), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), (13, '1.11.1'), ('openmp', False)), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1'), (11, 13, '1.11.1')), ('CLinker.cmodule_key', ('--param', '--param', '--param', '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION', '-O3', '-Wno-unused-label', '-Wno-unused-variable', '-Wno-write-strings', '-fPIC', '-fno-math-errno', '-m64', '-maes', '-march=core2', '-mavx', '-mcx16', '-mpclmul', '-mpopcnt', '-msahf', '-mtune=generic', 'l1-cache-line-size=64', 'l1-cache-size=32', 'l2-cache-size=20480'), (), (), 'NPY_ABI_VERSION=0x1000009', 'c_compiler_str=/usr/bin/g++ 4.4.7', 'md5:mcf7a57c9bb90efec81411a07c50224ca', (<theano.tensor.elemwise.Elemwise object at 0x2b2dabf61f60>, ((TensorType(int64, (True,)), ((-1, 0), False)), (TensorType(int64, vector), (('m983fd54d4e98388c1368723aa0e707a7', 0, 1), False)), (TensorType(float64, (True,)), ((-1, 2), False)), (TensorType(float64, (True,)), ((-1, 3), False)), (TensorType(int8, (True,)), (('m64cd427e9256099cfe5adc179cd8bf82', 0, 4), False)), (TensorType(int8, vector), (('m9f7ab43a62804b796c006032065d30e1', 0, 5), False)), (TensorType(float32, (True,)), (('m04dd9722ae525447134fc924204422a5', 0, 6), False)), (TensorType(float64, vector), (('mb71ebc0435c9bedc241effb12d3fcca8', 0, 7), False)), (TensorType(float64, vector), (('mc8711b416fa30c1b1eaa5fea9e6da412', 0, 8), False))), (1, (False,))))) ``` The function containing pymc3 code I'm using in my script: ``` def get_switchpoint(data): data[data < 0] = 0 bases = np.array(list(range(len(data)))) with pm.Model(verbose=False) as phredscore_model: switchpoint = pm.DiscreteUniform('switchpoint', lower=bases.min(), upper=bases.max()) early_rate = pm.Exponential('early_rate', 1) late_rate = pm.Exponential('late_rate', 1) rate = pm.math.switch(switchpoint >= bases, early_rate, late_rate) phredscore = pm.Poisson('phredscore', rate, observed=data) step = pm.Metropolis([early_rate, late_rate, switchpoint]) trace = pm.sample(4000, step=[step], progressbar=False) return np.array([ trace['switchpoint'][-1000:], trace['early_rate'][-1000:], trace['late_rate'][-1000:] ]) ``` I'm trying to run the script in an HPC environment and run it multiple times (approx 10K) using a wrapper script supplying arguments to it. Hence #1174 wont apply to my case. Each run instance of script iterates the pymc function thousands of time. As a quick fix can somebody also show me how to transpile this code to pymc2. Thank you Best regards, Parashar

twiecki · July 12, 2021, 10:28am

Thanks for reporting the solution @jstanley. I opened an issue to increase the default: Increase default compile lock time-out · Issue #521 · aesara-devs/aesara · GitHub.

jstanley · July 12, 2021, 7:35pm

Great, thanks @twiecki .

Topic		Replies	Views
Problem with multiprocessing in PyMC3 Questions	5	3749	August 20, 2018
Pymc3 getting stuck after initialization Questions	41	9395	February 9, 2022
Multiprocess crash version agnostic bug	7	556	December 20, 2023
PyMC3 fresh install yielding theano errors (linux) Questions	3	565	November 18, 2021
Disaster model Theano op: RuntimeError: Chain 0 failed Questions	6	2733	January 28, 2021

Issues parallelizing pymc3 model with the `multiprocessing` library

Context

Error

Additional info

Versions:

Related topics