Using pymc3 and theano on cluster with Slurm: filelock._error.Timeout

Hey community :slight_smile:
I am trying to run a Bayesian inference script on a high performance cluster with SUSE Linux Enterprise Server 15 SP2. Running this script from the command line with python3 script.py works fine. But trying to run the script with a Slurm bash script throws the following error:

filelock._error.Timeout: The file lock '.1662466228.theano/compiledir_Linux-5.3-46_6.0.29-cray_ari_c-x86_64-with-glibc2.26-x86_64-3.9.12-64/.lock' could not be acquired.

I already included the following changes to the theano.config in the bash script:

theano-cache purge
stamp=$(date +%s)
export THEANO_FLAGS="base_compiledir=.$stamp.theano/,compile__timeout=24,compile__wait=20"

These are suggested solutions from Running multiple instances of Pymc3 scripts simultaneously causes error! · Issue #1463 · pymc-devs/pymc · GitHub and Running multiple instances of Pymc3 scripts simultaneously causes error! · Issue #1463 · pymc-devs/pymc · GitHub.

I use pymc3 version 3.11.2 and theano-pymc version 1.1.2.

Thank you :slight_smile:

Emilius

1 Like

This doesn’t seem like a timeout issue to me, but I’m not confident in that. I think @michaelosthege has some experience running on clusters. Not sure if this has cropped up.

@cluhmann Thank you for your response.
I was able to figure it out. The problem was that Slurm was somehow not able to access the compile directory which was in my personal home directory. Changing it to e.g. base_compiledir=/var/tmp/.theano solved the issue.

1 Like

@erichter since theano-pymc 1.1.5 there have been several improvements to the way how compiledirs work in Aesara.
I would highly recommend updating to the latest PyMC release as it is not only much more convenient to work with, but also because we no longer support <4.0.0 versions.