Sample_smc stalls at final stage

rubvber · September 11, 2024, 7:54am

This is an issue I’ve experienced on multiple machines now (both on Windows and on Linux). I run sample_smc (4 cores, 4 chains, 10,000 draws per chain) and it goes along merrily through the stages, until it reaches the final stage with beta=1.000. And then it just stalls. It stays at that stage while the timer on the progress bar keeps counting up, and it keeps doing that until I manually interrupt it.

I know that it should have finished, because if I run the exact same code again (same input data & other settings - not necessarily the same random seed) it completes within 20 minutes or so, whereas when I manually interrupt it, it might have been going for hours (I just now had to do this on a run which ended up in this stalled state for 9+ hours - then when I started it again from scratch it completed in 12 minutes).

Any ideas what could be causing this or how to fix/debug it? I have a feeling it might be something related to multiprocessing getting somehow deadlocked, though I’m not sure exactly how.

(I’m using pymc version 5.16.2 on python 3.12.5)

ricardoV94 · September 11, 2024, 9:00am

Sounds like your model is not doing great with some random seeds, so it’s unstable?

You can run a single chain at a time and should find the same problem after a couple tries. That will rule out the blame being multiprocessing (which I doubt it is)

rubvber · September 12, 2024, 7:54am

Thanks for sharing your thoughts.

Sounds like your model is not doing great with some random seeds, so it’s unstable?

Hmm, but then why would it only ever stall at the very end? I know the SMC sampler keeps going until the autocorrelation with the samples from the previous stage drops below a given threshold. I can see how this could induce a stall if something is preventing that autocorrelation from ever getting low enough. However, this criterion is applied at every stage of the algorithm (i.e., for every increment of beta), and so then wouldn’t we expect to see stalls happening at intermediate stages too? (After all, there is nothing that special about beta=1; even if the instability only arose when the likelihood became more dominant, I wouldn’t expect to see it exclusively at the final stage.)

Also, I might add that this has happened on 2 out of 3 machines that I’ve run this code on, while the 3rd one has never has this problem, which points away from this being random.

ricardoV94 · September 12, 2024, 11:26am

Can you give a seed that sometimes fails but not always?

rubvber · September 12, 2024, 1:33pm

Yes! I was just now able to do this (took a few tries). I ran the code on one of my datasets with random seed 42 (that is: pymc.sample_smc(…, random_seed=42)). The first time, it stalled: it reached the final stage after about 10 minutes, but then did not progress for the next 5 minutes. I then started it again from scratch, and again it reached the final stage after about 10 minutes, but this time it finished successfully shortly after.

As an additional data point: for some reason, whenever the sampling seems to be done, under normal circumstances the following happens:

Sampling 4 chains in 4 jobs
Chain 0 ⠏ -:–:-- / 0:10:39 Stage: 16 Beta: 1.000
Chain 1 ⠏ -:–:-- / 0:10:39 Stage: 16 Beta: 1.000
Chain 2 ⠏ -:–:-- / 0:10:39 Stage: 17 Beta: 1.000

Chain 0 ⠙ -:–:-- / 0:10:41 Stage: 16 Beta: 1.000
Chain 1 ⠙ -:–:-- / 0:10:41 Stage: 16 Beta: 1.000
Chain 2 ⠙ -:–:-- / 0:10:41 Stage: 17 Beta: 1.000
Chain 3 ⠙ -:–:-- / 0:10:41 Stage: 16 Beta: 1.000

That is, an empty line gets printed under the “progress bars” that had been updating in-place up to that point (for some reason erasing the final line corresponding to chain 3), and a new (identical) set of progress bars gets printed underneath. As you can see, these new progress bars only remain for a few seconds while the sampling process appears to be finishing up (collecting data from the 4 different threads, perhaps? or some other “admin”?) .

When a stall happens, I do get a new set of progress bars, and the last line from the previous set is also erased but, strangely, the empty line does not get printed (this seems to be consistently the case across at least 5 instances):

Sampling 4 chains in 4 jobs
Chain 0 ⠸ -:–:-- / 0:10:40 Stage: 16 Beta: 1.000
Chain 1 ⠸ -:–:-- / 0:10:40 Stage: 16 Beta: 1.000
Chain 2 ⠸ -:–:-- / 0:10:40 Stage: 17 Beta: 1.000
Chain 0 ⠦ -:–:-- / 0:15:08 Stage: 16 Beta: 1.000
Chain 1 ⠦ -:–:-- / 0:15:08 Stage: 16 Beta: 1.000
Chain 2 ⠦ -:–:-- / 0:15:08 Stage: 17 Beta: 1.000
Chain 3 ⠦ -:–:-- / 0:15:08 Stage: 16 Beta: 1.000

And of course, in that case the timer just keeps ticking along until I manually interrupt the execution.

ricardoV94 · September 12, 2024, 1:42pm

Can you provide a way to run your code?

rubvber · September 12, 2024, 1:44pm

Unfortunately I can’t. It’s using proprietary, privacy-sensitive data, and the code itself is also potentially sensitive IP.

ricardoV94 · September 12, 2024, 4:11pm

Can you try to code something else from scratch/ nom-IP that reproduces the problem? Without a way to reproduce it will be hard to figure out what’s going on

rubvber · September 13, 2024, 7:26am

Do you have any hunch or leads I could explore in order to try to reproduce the problem outside this specific use case, or narrow down the cause a little? It’s difficult, otherwise, to know what elements from the “problematic” code to try and carry over to a minimal (non-sensitive) example.

ricardoV94 · September 13, 2024, 9:08am

Not really. But you have a model that’s problematic so maybe you can pair it down until it’s no longer sensitive and still reproduces the problem. Or during that process you may be able to spot the problem

vian · September 17, 2024, 1:01pm

I’ve experienced a similar behavior and I remember examining the influence of having -inf likelihoods or float64 vs. float32 precision, but I don’t remember the details… Have you tried playing with these?

rubvber · September 17, 2024, 1:18pm

Hmm interesting. I’m using (default) 64-bit precision which I assume should be more stable (I tried switching to 32-bit to speed things up, but never got that to work). I don’t think I’ve seen infinities but maybe I haven’t looked in the right place. How would you recommend diagnosing this? Just logging the logp’s and checking for infs?

vian · September 17, 2024, 2:22pm

For float32 (which does speed things up for sure) I guess the easiest is to do something like:
export THEANO_FLAGS=‘floatX=float32,base_compiledir=/tmp/theano.NOBACKUP’; python script.py (on a side note I noticed there could be issues if dtype is set to different precisions in the various RVs).

For the infinity it depends how the likelihood is calculated. I’m using a custom function and put everything in a pm.Potential, so it’s easy enough to identify -inf values. I’ll keep an eye on this behavior at the end of the SMC sampling but it’s been some time I saw it so I must have fixed it one way or another (at least I hope…!).

Update: it still happens once in a while with my code, with float32 precision and with the same random seed.

Update #2: for what it’s worth, forcing a single worker does seem to solve the problem. Strangely, some random seeds do seem to cause the infinite loop more than others (but still not reproducible 100%).

rubvber · September 18, 2024, 8:30am

Thanks for taking the time to look into the issue within your setup! Interesting that forcing a single worker seems to solve it for you (hadn’t tried that yet), which matches my hunch that it is something to do with multiprocessing.

Another data point that points in this direction: if I look at the CPU usage, it drops to almost 0% at the point where the execution stalls. So it seems clear that the actual sampling has stopped, and something is waiting for something else that never happens.

When I interrupt the execution when it has reached a stall, it says it’s on line 390 of run_chains, which is where the ProcessPoolExecutor ‘with context’ is opened, which makes me wonder whether the problem is in shutting down the executor, so that the context fails to close and we never exit this block (@ricardoV94)? I’ve (just now) added some print statements inside that function to try to see what happens during a stall, but since I cannot reproduce it reliably, I’ll have to wait until one occurs.

rubvber · September 18, 2024, 12:56pm

Just got another stall which gave me some more information about where the problem is occurring. Below is a section from from run_chains in sampling.py in which I inserted some print statements, to see how far the execution has gotten when a stall happens. In the stall that just happened, the “All chains finished” statement did get printed, so I know that all the multiprocessing tasks had completed. And yet, the progress bars remain on screen and the timers are still counting up. So it really seems to me like the thing that needs to happen, and is failing to happen, is for the executor to be shut down so that the “with ProcessPoolExecutor” context can be closed and we can exit the run_chains function.

with ProcessPoolExecutor(max_workers=cores) as executor:
    for c in range(chains):  # iterate over the jobs we need to run
        # set visible false so we don't have a lot of bars all at once:
        task_id = progress.add_task(f"Chain {c}", status="Stage: 0 Beta: 0")
        futures.append(
            executor.submit(
                _sample_smc_int,
                *params,
                random_seed[c],
                c,
                _progress,
                task_id,
                **kernel_kwargs,
            )
        )

    # monitor the progress:
    done = []
    remaining = futures                
    num_tasks = len(futures)
    while len(remaining) > 0 and len(done) < num_tasks:
        finished, remaining = wait(remaining, timeout=0.1)                    
        if len(finished)>0:
            done.extend(finished)      
            print('{} chains finished'.format(len(done)))              
        for task_id, update_data in _progress.items():
            stage = update_data["stage"]
            beta = update_data["beta"]
            # update the progress bar for this task:
            progress.update(
                status=f"Stage: {stage} Beta: {beta:.3f}",
                task_id=task_id,
                refresh=True,
            )

    print('All chains finished')

ricardoV94 · September 18, 2024, 1:39pm

CC @aseyboldt who was one of the last people to touch on this code

ricardoV94 · September 18, 2024, 1:44pm

Can you test it in pymc==5.12.0 before we added the ProcessPoolExcecutor for the new progress bars?

rubvber · September 19, 2024, 7:32am

I’ve just started a new run with that version. Will let you know how it goes.

vian · September 19, 2024, 9:26am

On my side I’m somewhat confused: I’ve launched many runs yesterday and the stalling never came up again… I hope that @rubvber’s tests can be conclusive despite the erratic behavior.

rubvber · September 20, 2024, 6:47am

Just completed a full run through my entire dataset (about 70 instances) with version 5.12.0, with zero stalls. Now, they were always (apparently) random before so that’s not a 100% guarantee that the issue isn’t there in this version, but I feel pretty confident as I’ve never been able to get even half that far before without at least one stall.

I hope this further helps to diagnose & fix the problem. In the meantime, would there be any downside to me continuing to use 5.12.0 (other than not having the fancier progress bars)?

Topic		Replies	Views
Pymc3 getting stuck after initialization Questions	41	8891	February 9, 2022
Pm.sample gets stuck after init with cores > 1 Questions	17	3939	January 4, 2021
New User - NUTS Running forever on regression tutorial	8	107	February 14, 2025
Sample with multiple cores Questions	3	1469	September 10, 2020
Results not fully reproducible Questions bug	1	743	May 10, 2022

Sample_smc stalls at final stage

Related topics