Multiprocess sampling with SMC fails on linux cluster

Hi there,

I am using the SMC sampler with pymc version 3.11.2 and theano-pymc version 1.1.2 in a conda environment. I am working on a linux cluster with several CPUs. I tried the multiprocess sampling on multiple cores with multiple SMC chains. For testing, I used the example (without gradient) from here:
https://docs.pymc.io/en/v3/pymc-examples/examples/samplers/SMC2_gaussians.html

When I run it directly from the shell on one of the workers, it runs fine. When I submit it with condor to be run on multiple cores however, I get a weird error message (printed in the end). The cluster administrator suggested it is a theano or pymc bug, so I am asking here. I already tried changing the time limit in my theanorc following this:

I also already changed the compile dir of theano to /tmp/theanodir, but both did not help. The error is not deterministic, sometimes it works, sometimes it does not even start to sample (as in the pasted error message) and sometimes it samples and at the last step gives the error.
Thanks in advance for any help!
Here is the error message:

Initializing SMC sampler…
Sampling 3 chains in 16 jobs
multiprocessing.pool.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/multiprocessing/pool.py”, line 125, in worker
result = (True, func(*args, **kwds))
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/multiprocessing/pool.py”, line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/pymc3/smc/sample_smc.py”, line 267, in sample_smc_int
smc.setup_kernel()
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/pymc3/smc/smc.py”, line 136, in setup_kernel
print([self.model.datalogpt], self.variables, shared)
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/pymc3/model.py”, line 1049, in datalogpt
factors += [tt.sum(factor) for factor in self.potentials]
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/pymc3/model.py”, line 1049, in
factors += [tt.sum(factor) for factor in self.potentials]
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/tensor/basic.py”, line 3221, in sum
out = elemwise.Sum(axis=axis, dtype=dtype, acc_dtype=acc_dtype)(input)
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/graph/op.py”, line 253, in call
compute_test_value(node)
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/graph/op.py”, line 126, in compute_test_value
thunk = node.op.make_thunk(node, storage_map, compute_map, no_recycling=[])
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/graph/op.py”, line 634, in make_thunk
return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/graph/op.py”, line 600, in make_c_thunk
outputs = cl.make_thunk(
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/link/c/basic.py”, line 1203, in make_thunk
cthunk, module, in_storage, out_storage, error_storage = self.compile(
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/link/c/basic.py”, line 1138, in compile
thunk, module = self.cthunk_factory(
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/link/c/basic.py”, line 1634, in cthunk_factory
module = get_module_cache().module_from_key(key=key, lnk=self)
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/link/c/cmodule.py”, line 1157, in module_from_key
module = self._get_from_hash(module_hash, key)
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/link/c/cmodule.py”, line 1060, in _get_from_hash
key_data.add_key(key, save_pkl=bool(key[0]))
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/theano/link/c/cmodule.py”, line 497, in add_key
assert key not in self.keys
AssertionError
“”"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “…/fit/fit_test_pymc.py”, line 432, in
trace = pm.sample_smc(start=start, draws=kw.n_draws, n_steps=kw.n_tune_smc, chains=kw.n_chains,
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/pymc3/smc/sample_smc.py”, line 196, in sample_smc
results = pool.starmap(
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/multiprocessing/pool.py”, line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/multiprocessing/pool.py”, line 771, in get
raise self._value
AssertionError

Sometimes it will also look like this:
Initializing SMC sampler…
Sampling 4 chains in 16 jobs
/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/site-packages/pymc3/sampling.py:1925: UserWarning: The effect of Potentials on other parameter>
warnings.warn(
Stage: 0 Beta: 0.000
Stage: 1 Beta: 0.009
Stage: 2 Beta: 0.027
Stage: 3 Beta: 0.062
Stage: 4 Beta: 0.126
Stage: 5 Beta: 0.254
Stage: 6 Beta: 0.537
Stage: 7 Beta: 0.977
Stage: 8 Beta: 1.000
multiprocessing.pool.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/net/scratch/auger/conda/envs/teresa_py38_pymc311/lib/python3.8/multiprocessing/pool.py”, line 125, in worker
result = (True, func(*args, **kwds))

so in this case the sampler finished and then the same error appeared in the end.