This is an issue I’ve experienced on multiple machines now (both on Windows and on Linux). I run sample_smc (4 cores, 4 chains, 10,000 draws per chain) and it goes along merrily through the stages, until it reaches the final stage with beta=1.000. And then it just stalls. It stays at that stage while the timer on the progress bar keeps counting up, and it keeps doing that until I manually interrupt it.
I know that it should have finished, because if I run the exact same code again (same input data & other settings - not necessarily the same random seed) it completes within 20 minutes or so, whereas when I manually interrupt it, it might have been going for hours (I just now had to do this on a run which ended up in this stalled state for 9+ hours - then when I started it again from scratch it completed in 12 minutes).
Any ideas what could be causing this or how to fix/debug it? I have a feeling it might be something related to multiprocessing getting somehow deadlocked, though I’m not sure exactly how.
Sounds like your model is not doing great with some random seeds, so it’s unstable?
You can run a single chain at a time and should find the same problem after a couple tries. That will rule out the blame being multiprocessing (which I doubt it is)
Sounds like your model is not doing great with some random seeds, so it’s unstable?
Hmm, but then why would it only ever stall at the very end? I know the SMC sampler keeps going until the autocorrelation with the samples from the previous stage drops below a given threshold. I can see how this could induce a stall if something is preventing that autocorrelation from ever getting low enough. However, this criterion is applied at every stage of the algorithm (i.e., for every increment of beta), and so then wouldn’t we expect to see stalls happening at intermediate stages too? (After all, there is nothing that special about beta=1; even if the instability only arose when the likelihood became more dominant, I wouldn’t expect to see it exclusively at the final stage.)
Also, I might add that this has happened on 2 out of 3 machines that I’ve run this code on, while the 3rd one has never has this problem, which points away from this being random.
Yes! I was just now able to do this (took a few tries). I ran the code on one of my datasets with random seed 42 (that is: pymc.sample_smc(…, random_seed=42)). The first time, it stalled: it reached the final stage after about 10 minutes, but then did not progress for the next 5 minutes. I then started it again from scratch, and again it reached the final stage after about 10 minutes, but this time it finished successfully shortly after.
As an additional data point: for some reason, whenever the sampling seems to be done, under normal circumstances the following happens:
That is, an empty line gets printed under the “progress bars” that had been updating in-place up to that point (for some reason erasing the final line corresponding to chain 3), and a new (identical) set of progress bars gets printed underneath. As you can see, these new progress bars only remain for a few seconds while the sampling process appears to be finishing up (collecting data from the 4 different threads, perhaps? or some other “admin”?) .
When a stall happens, I do get a new set of progress bars, and the last line from the previous set is also erased but, strangely, the empty line does not get printed (this seems to be consistently the case across at least 5 instances):
Can you try to code something else from scratch/ nom-IP that reproduces the problem? Without a way to reproduce it will be hard to figure out what’s going on
Do you have any hunch or leads I could explore in order to try to reproduce the problem outside this specific use case, or narrow down the cause a little? It’s difficult, otherwise, to know what elements from the “problematic” code to try and carry over to a minimal (non-sensitive) example.
Not really. But you have a model that’s problematic so maybe you can pair it down until it’s no longer sensitive and still reproduces the problem. Or during that process you may be able to spot the problem
I’ve experienced a similar behavior and I remember examining the influence of having -inf likelihoods or float64 vs. float32 precision, but I don’t remember the details… Have you tried playing with these?
Hmm interesting. I’m using (default) 64-bit precision which I assume should be more stable (I tried switching to 32-bit to speed things up, but never got that to work). I don’t think I’ve seen infinities but maybe I haven’t looked in the right place. How would you recommend diagnosing this? Just logging the logp’s and checking for infs?
For float32 (which does speed things up for sure) I guess the easiest is to do something like:
export THEANO_FLAGS=‘floatX=float32,base_compiledir=/tmp/theano.NOBACKUP’; python script.py (on a side note I noticed there could be issues if dtype is set to different precisions in the various RVs).
For the infinity it depends how the likelihood is calculated. I’m using a custom function and put everything in a pm.Potential, so it’s easy enough to identify -inf values. I’ll keep an eye on this behavior at the end of the SMC sampling but it’s been some time I saw it so I must have fixed it one way or another (at least I hope…!).
Update: it still happens once in a while with my code, with float32 precision and with the same random seed.
Update #2: for what it’s worth, forcing a single worker does seem to solve the problem. Strangely, some random seeds do seem to cause the infinite loop more than others (but still not reproducible 100%).
Thanks for taking the time to look into the issue within your setup! Interesting that forcing a single worker seems to solve it for you (hadn’t tried that yet), which matches my hunch that it is something to do with multiprocessing.
Another data point that points in this direction: if I look at the CPU usage, it drops to almost 0% at the point where the execution stalls. So it seems clear that the actual sampling has stopped, and something is waiting for something else that never happens.
When I interrupt the execution when it has reached a stall, it says it’s on line 390 of run_chains, which is where the ProcessPoolExecutor ‘with context’ is opened, which makes me wonder whether the problem is in shutting down the executor, so that the context fails to close and we never exit this block (@ricardoV94)? I’ve (just now) added some print statements inside that function to try to see what happens during a stall, but since I cannot reproduce it reliably, I’ll have to wait until one occurs.
Just got another stall which gave me some more information about where the problem is occurring. Below is a section from from run_chains in sampling.py in which I inserted some print statements, to see how far the execution has gotten when a stall happens. In the stall that just happened, the “All chains finished” statement did get printed, so I know that all the multiprocessing tasks had completed. And yet, the progress bars remain on screen and the timers are still counting up. So it really seems to me like the thing that needs to happen, and is failing to happen, is for the executor to be shut down so that the “with ProcessPoolExecutor” context can be closed and we can exit the run_chains function.
with ProcessPoolExecutor(max_workers=cores) as executor:
for c in range(chains): # iterate over the jobs we need to run
# set visible false so we don't have a lot of bars all at once:
task_id = progress.add_task(f"Chain {c}", status="Stage: 0 Beta: 0")
futures.append(
executor.submit(
_sample_smc_int,
*params,
random_seed[c],
c,
_progress,
task_id,
**kernel_kwargs,
)
)
# monitor the progress:
done = []
remaining = futures
num_tasks = len(futures)
while len(remaining) > 0 and len(done) < num_tasks:
finished, remaining = wait(remaining, timeout=0.1)
if len(finished)>0:
done.extend(finished)
print('{} chains finished'.format(len(done)))
for task_id, update_data in _progress.items():
stage = update_data["stage"]
beta = update_data["beta"]
# update the progress bar for this task:
progress.update(
status=f"Stage: {stage} Beta: {beta:.3f}",
task_id=task_id,
refresh=True,
)
print('All chains finished')
On my side I’m somewhat confused: I’ve launched many runs yesterday and the stalling never came up again… I hope that @rubvber’s tests can be conclusive despite the erratic behavior.
Just completed a full run through my entire dataset (about 70 instances) with version 5.12.0, with zero stalls. Now, they were always (apparently) random before so that’s not a 100% guarantee that the issue isn’t there in this version, but I feel pretty confident as I’ve never been able to get even half that far before without at least one stall.
I hope this further helps to diagnose & fix the problem. In the meantime, would there be any downside to me continuing to use 5.12.0 (other than not having the fancier progress bars)?