Hello, PYMC community!
I am encountering problems when running a PYMC model as a part of a Snakemake pipeline, particularly under high parallelism within a UGE (Univa Grid Engine) environment (link). The problem manifests as missing output files due to PyTensor error (ModuleNotFoundError and AssertionError).
The pipeline consists of two steps:
- generating the data with certain parameters (here I use PYMC to simulate data, draws = 1, cores = 1)
- checking the model with simulated data (here I use PYMC to fit the data to the model, default parameters for pm.sample, cores = 1).
Mainly the problem arises during the second step, rarely during the first.
Environment:
- PYMC version: 5.10.3
- Python version 3.10.11
- Operating system: Rocky Linux
- Environment: Snakemake with a UGE profile, running up to 500 parallel jobs in a mamba environment
Details:
-
Intermittent missing output. Although Snakemake logs report successful job completions, output files are missing. The errors do not occur consistently with certain parameter sets, so the parameters themselves are unlikely the root cause.
-
UGE profile. The pipeline is executed within a UGE profile, specifying 500 jobs for parallel processing. Running the pipeline with a single job (
--jobs=1
) or executing it locally does not reproduce these errors. -
Errors in job logs. When I look in the logs file for the failed job, I see that actually job did not finish because of a PyTensor error.
-
PyTensor Errors: There are two main types of errors in the job’s log files:
- ModuleNotFoundError (a temporary module created by PyTensor is missing);
- AssertionError, (a problem with key verification in the PyTensor compiled directory).
What I tried
- Increased the allocated space for the environment to ensure that space restrictions were not causing the issue. This adjustment did not resolve the problem, so probably space scarcity is not the issue.
- Set cores=1 for pm.sample step. Did not help either.
Has anyone encountered similar issues with PyMC or PyTensor within a Snakemake pipeline, and how were they resolved?
Thank you!
Error examples
ModuleNotFound.txt (12.2 KB)
Assertion_error.txt (18.8 KB)