Pm.sample_posterior_predictive hanging on certain architecture

Hey, novice pymc user here.

I’m attempting to set up pytests that call pm.sample_posterior_predictive as part of my project’s CI/CD. These tests have to be run on a self-hosted runner for security reasons. I can run these tests locally, and they pass no problem, but when I try to run them in git actions, the process hangs when sampling starts (logs show Sampling: [] indefinitely). I’ve pinpointed the stall to the following statement in my code.

posterior_predictive_oos = pm.sample_posterior_predictive(
    trace=self.idata,
    var_names=_var_names,
    predictions=True,
    random_seed=42,
)

I’ve validated that self.idata is populated appropriately at call time, and identically to when I run locally; same for _var_names. I’ve validated that all package versions are identical to locally, where it passes. I’ve tried adding a keep-alive logging process, which also fails (stops printing logs at the expected cadence), which makes me think the process has failed entirely (e.g. some memory issue has crashed the system). However, before executing the above statement, the git runner has more memory available than I do locally when the test runs successfully (when the above statement is called, psutil.virtual_memory().available outputs >100 GB).

I think this issue might be related to this discussion, but I couldn’t find anything in that discussion to resolve my issue.

OS: Linux 5.15.0-1071-azure
pymc: pymc-5.17.0
pytensor: pytensor-2.25.5

Is this a known issue with any kinds of architecture? Anything else I can do to help diagnose the issue?

Thank you for your help

An update:

I was wondering if this might be a memory issue, so I tried sampling the posterior before getting to pm.sample_posterior_predictive().

If I run this one specific statement before pm.sample_posterior_predictive(), sampling is successful and the test passes:
self.idata.posterior = self.idata.posterior.stack(sample=("chain", "draw")).isel(sample=slice(0, 500)).unstack("sample")

But if I change that statement to slice 501 elements, execution freezes. If I slice the first 1000 elements, execution freezes. Even worse, execution freezes at least 3 layers up in the stack - execution doesn’t even reach 3 layers before the function where this statement is called. The only change between a situation where the test runs successfully and one where execution freezes several steps before even running this statement is changing this number from 500 to 501.

Not necessarily any closer to solving the issue, but some very strange behavior I’ve noticed that might help pinpoint.

Is there anything unusual about your model? Custom Op, while Scan, IfElse nodes?

You can try to run in python only mode for debugging by setting compile_kwargs=dict(mode="FAST_COMPILE") when calling sample_posterior_predictive

Thank you, @ricardoV94 !

I checked with our modeling team, and no, we don’t have anything like that in our models. I tried adding compile_kwargs=dict(mode="FAST_COMPILE") to the arguments for sample_posterior_predictive, and unfortunately it didn’t seem to change anything - I didn’t see any additional logs or anything.