Issues with PyMC Execution using Snakemake: PyTensor Errors

Hello, PYMC community!

I am encountering problems when running a PYMC model as a part of a Snakemake pipeline, particularly under high parallelism within a UGE (Univa Grid Engine) environment (link). The problem manifests as missing output files due to PyTensor error (ModuleNotFoundError and AssertionError).

The pipeline consists of two steps:

  • generating the data with certain parameters (here I use PYMC to simulate data, draws = 1, cores = 1)
  • checking the model with simulated data (here I use PYMC to fit the data to the model, default parameters for pm.sample, cores = 1).
    Mainly the problem arises during the second step, rarely during the first.

Environment:

  • PYMC version: 5.10.3
  • Python version 3.10.11
  • Operating system: Rocky Linux
  • Environment: Snakemake with a UGE profile, running up to 500 parallel jobs in a mamba environment

Details:

  1. Intermittent missing output. Although Snakemake logs report successful job completions, output files are missing. The errors do not occur consistently with certain parameter sets, so the parameters themselves are unlikely the root cause.

  2. UGE profile. The pipeline is executed within a UGE profile, specifying 500 jobs for parallel processing. Running the pipeline with a single job (--jobs=1 ) or executing it locally does not reproduce these errors.

  3. Errors in job logs. When I look in the logs file for the failed job, I see that actually job did not finish because of a PyTensor error.

  4. PyTensor Errors: There are two main types of errors in the job’s log files:

  • ModuleNotFoundError (a temporary module created by PyTensor is missing);
  • AssertionError, (a problem with key verification in the PyTensor compiled directory).

What I tried

  • Increased the allocated space for the environment to ensure that space restrictions were not causing the issue. This adjustment did not resolve the problem, so probably space scarcity is not the issue.
  • Set cores=1 for pm.sample step. Did not help either.

Has anyone encountered similar issues with PyMC or PyTensor within a Snakemake pipeline, and how were they resolved?

Thank you!

Error examples
ModuleNotFound.txt (12.2 KB)
Assertion_error.txt (18.8 KB)

CC @lucianopaz

1 Like

I think that this problem happened to me on a different cluster. The issue I had to deal with was that all parallel processes were storing the compiled modules in the exact same folder: .pytensor/ at the home directory. Just to give some background, when you use the C backend, the computational graph gets optimized, then it’s transpiled into C, and finally it gets compiled into a shared object library with some python hooks to make it importable. When a pytensor process finishes, it sometimes cleans the compilation directory. Since you are running multiple jobs in parallel, you can sometimes run into concurrency issues when a process cleans the compilation directory before another process has had the chance to import the compiled module.

The solution I did, was to set a different compilation directory per process using some environment variables. The implementation will be specific to your shell and your cluster, so you’ll have to figure that one out. What I did was to write a very small bash script that set one environment variable and then called the python script that did all the work:

export PYTENSOR_FLAGS="compiledir=$HOME/.pytensor/compiledir_$(uuidgen)"
./scripts/simulation_study.py "$@"

You would have to call this bash script instead of the python script as the entry point in your cluster job setup, but after that, you shouldn’t run into these problems.

Thank you, it worked!
I added it to the shell in Snakefile, like this:

export PYTENSOR_FLAGS="compiledir=$HOME/.pytensor/compiledir_{params.ident}"
1 Like