Hi all,
I’ve been really struggling with this error when trying to run my model on a compute cluster. I don’t seem to ever encounter it when running a single job at a time, but anything more than 4/5 in parallel and at some point (it varies quite a lot) all the jobs running on the same node crash with the following error:
ERROR (theano.graph.opt): Optimization failure due to: constant_folding
ERROR (theano.graph.opt): node: Elemwise{clip,no_inplace}(TensorConstant{[[1 0 2 2 .. 2 2 2 2]]}, TensorConstant{(1, 1) of 0}, TensorConstant{(1, 1) of 3})
ERROR (theano.graph.opt): TRACEBACK:
ERROR (theano.graph.opt): Traceback (most recent call last):
File "/exports/eddie/scratch/s1983893/pymc_pop/lib/python3.9/site-packages/theano/graph/opt.py", line 2017, in process_node
replacements = lopt.transform(fgraph, node)
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/graph/opt.py", line 1209, in transform
return self.fn(*args, **kwargs)
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/tensor/opt.py", line 7006, in constant_folding
thunk = node.op.make_thunk(
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/graph/op.py", line 634, in make_thunk
return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/graph/op.py", line 600, in make_c_thunk
outputs = cl.make_thunk(
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/link/c/basic.py", line 1203, in make_thunk
cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/link/c/basic.py", line 1138, in __compile__
thunk, module = self.cthunk_factory(
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/link/c/basic.py", line 1634, in cthunk_factory
module = get_module_cache().module_from_key(key=key, lnk=self)
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1160, in module_from_key
module = self._get_from_hash(module_hash, key)
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1072, in _get_from_hash
self.check_key(key, key_data.key_pkl)
File "/exports/eddie/scratch/ s1983893/pymc_pop/lib/python3.9/site-packages/theano/link/c/cmodule.py", line 1269, in check_key
raise AssertionError(
AssertionError: Key not found in unpickled KeyData file. Verify the __eq__ and __hash__ functions of your Ops. The file is: /exports/eddie/scratch/vnedelcu/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-glibc2.17-x86_64-3.9.12-64/tmp00h3swtc/key.pkl. The key is: (((13, (4,), (13, '1.22.3'), (13, '1.22.3'), (13, '1.22.3'), (13, '1.22.3'), ('openmp', False)), ('scalar_op', 'inplace_pattern'), (11, 13, '1.22.3'), (11, 13, '1.22.3'), (11, 13, '1.22.3'), (11, 13, '1.22.3')), ('CLinker.cmodule_key', ('--param', '--param', '--param', '-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION', '-O3', '-Wno-c++11-narrowing', '-Wno-unused-label', '-Wno-unused-variable', '-Wno-write-strings', '-fPIC', '-fno-math-errno', '-m64', '-mabm', '-maes', '-march=haswell', '-mavx', '-mavx2', '-mbmi', '-mbmi2', '-mcx16', '-mf16c', '-mfma', '-mfsgsbase', '-mfxsr', '-mlzcnt', '-mmmx', '-mmovbe', '-mno-3dnow', '-mno-adx', '-mno-avx5124fmaps', '-mno-avx5124vnniw', '-mno-avx512bitalg', '-mno-avx512bw', '-mno-avx512cd', '-mno-avx512dq', '-mno-avx512er', '-mno-avx512f', '-mno-avx512ifma', '-mno-avx512pf', '-mno-avx512vbmi', '-mno-avx512vbmi2', '-mno-avx512vl', '-mno-avx512vnni', '-mno-avx512vpopcntdq', '-mno-cldemote', '-mno-clflushopt', '-mno-clwb', '-mno-clzero', '-mno-fma4', '-mno-gfni', '-mno-hle', '-mno-lwp', '-mno-movdir64b', '-mno-movdiri', '-mno-mwaitx', '-mno-pconfig', '-mno-pku', '-mno-prefetchwt1', '-mno-prfchw', '-mno-ptwrite', '-mno-rdpid', '-mno-rdseed', '-mno-rtm', '-mno-sgx', '-mno-sha', '-mno-shstk', '-mno-sse4a', '-mno-tbm', '-mno-vaes', '-mno-vpclmulqdq', '-mno-waitpkg', '-mno-wbnoinvd', '-mno-xop', '-mno-xsavec', '-mno-xsaves', '-mpclmul', '-mpopcnt', '-mrdrnd', '-msahf', '-msse', '-msse2', '-msse3', '-msse4.1', '-msse4.2', '-mssse3', '-mtune=haswell', '-mxsave', '-mxsaveopt', 'l1-cache-line-size=64', 'l1-cache-size=32', 'l2-cache-size=20480'), (), (), 'NPY_ABI_VERSION=0x1000009', 'c_compiler_str=/exports/eddie/scratch/vnedelcu/pymc_pop/bin/g++ 9.4.0', 'md5:m516ece8d141053125b5cf97e4665b958caebce5148f8a4c3dad250b731bdfb42', (<theano.tensor.elemwise.Elemwise object at 0x2b1e78a31220>, ((TensorType(int64, row), (('maf5f18cafb0241757d43109ca5231ceab844f22a68c1ecf904d131f2aec4ae0c', 0, 0), False)), (TensorType(int8, (True, True)), (('ma764b8503f352b173af5dec70989435bd0be2fb551d548bedf3667f24a4f653e', 0, 1), False)), (TensorType(int64, (True, True)), (('me04dcd1614b9c9c19023663a69b1f48c71f8059b5d17bd029bf72e398c8a0118', 0, 2), False))), (1, (False,)))))
I’ve tried increasing the compile lock time-out to 10000, thinking this might have something to do with it (Issues parallelizing pymc3 model with the `multiprocessing` library), but that didn’t work. If I try removing all parallelization in my sampling (i.e., I use 1 chain + 1 job) the error still occurs. I am using pymc3 v3.11.4 and theano-pymc v1.1.2.
I would really appreciate any help, as I have no clue what could be causing this, or even if there’s any way to solve it!