Pm.sample_posterior_predictive hanging on certain architecture

nickmilikich · October 26, 2024, 9:14pm

Hey, novice pymc user here.

I’m attempting to set up pytests that call pm.sample_posterior_predictive as part of my project’s CI/CD. These tests have to be run on a self-hosted runner for security reasons. I can run these tests locally, and they pass no problem, but when I try to run them in git actions, the process hangs when sampling starts (logs show Sampling: [] indefinitely). I’ve pinpointed the stall to the following statement in my code.

posterior_predictive_oos = pm.sample_posterior_predictive(
    trace=self.idata,
    var_names=_var_names,
    predictions=True,
    random_seed=42,
)

I’ve validated that self.idata is populated appropriately at call time, and identically to when I run locally; same for _var_names. I’ve validated that all package versions are identical to locally, where it passes. I’ve tried adding a keep-alive logging process, which also fails (stops printing logs at the expected cadence), which makes me think the process has failed entirely (e.g. some memory issue has crashed the system). However, before executing the above statement, the git runner has more memory available than I do locally when the test runs successfully (when the above statement is called, psutil.virtual_memory().available outputs >100 GB).

I think this issue might be related to this discussion, but I couldn’t find anything in that discussion to resolve my issue.

OS: Linux 5.15.0-1071-azure
pymc: pymc-5.17.0
pytensor: pytensor-2.25.5

Is this a known issue with any kinds of architecture? Anything else I can do to help diagnose the issue?

Thank you for your help

nickmilikich · October 29, 2024, 4:40am

An update:

I was wondering if this might be a memory issue, so I tried sampling the posterior before getting to pm.sample_posterior_predictive().

If I run this one specific statement before pm.sample_posterior_predictive(), sampling is successful and the test passes:
self.idata.posterior = self.idata.posterior.stack(sample=("chain", "draw")).isel(sample=slice(0, 500)).unstack("sample")

But if I change that statement to slice 501 elements, execution freezes. If I slice the first 1000 elements, execution freezes. Even worse, execution freezes at least 3 layers up in the stack - execution doesn’t even reach 3 layers before the function where this statement is called. The only change between a situation where the test runs successfully and one where execution freezes several steps before even running this statement is changing this number from 500 to 501.

Not necessarily any closer to solving the issue, but some very strange behavior I’ve noticed that might help pinpoint.

ricardoV94 · October 29, 2024, 6:40am

Is there anything unusual about your model? Custom Op, while Scan, IfElse nodes?

You can try to run in python only mode for debugging by setting compile_kwargs=dict(mode="FAST_COMPILE") when calling sample_posterior_predictive

nickmilikich · October 30, 2024, 1:38am

Thank you, @ricardoV94 !

I checked with our modeling team, and no, we don’t have anything like that in our models. I tried adding compile_kwargs=dict(mode="FAST_COMPILE") to the arguments for sample_posterior_predictive, and unfortunately it didn’t seem to change anything - I didn’t see any additional logs or anything.

Topic		Replies	Views
Pm.sample gets stuck after init with cores > 1 Questions	17	3928	January 4, 2021
New bug in fast sample posterior predictive Questions	3	528	July 11, 2021
Load_trace and sample_posterior_predictive do not work Questions	1	594	September 2, 2019
Bug in fast sample posterior predictive? Questions	9	1493	March 14, 2021
Memory Error with posterior_predictive_sample Questions	10	1554	March 12, 2019

Pm.sample_posterior_predictive hanging on certain architecture

Related topics