Sampling doesn't start when njobs > 1 for some models

aplavin · February 12, 2018, 6:15am

Recently I noticed that I can’t sample certain models with several parallel chains, while others work ok. For a minimal example of what doesn’t work:

import numpy as np
import pymc3 as pm

n = 96
X = np.random.rand(n, 1)
observed = np.random.rand(n)

with pm.Model() as model:
    cov = pm.gp.cov.ExpQuad(1, 1)
    gp = pm.gp.Latent(mean_func=pm.gp.mean.Zero(), cov_func=cov)
    gp_f = gp.prior('gp_f', X=X)
    val_obs = pm.Normal('val_obs', mu=gp_f, sd=0.1, observed=observed)

    trace = pm.sample(njobs=2)

No matter if run in a separate python script or from a notebook, the output is the following:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [gp_f_rotated_]
  0%|                                                                              | 0/1000 [00:00<?, ?it/s]

Then it keeps indefinitely in this state, not loading the CPU at all. The traceback when interrupting the process after some waiting:

Traceback (most recent call last):
  File "tmp.py", line 14, in <module>
    trace = pm.sample(njobs=2)
  File "*/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py", line 437, in sample
    trace = _mp_sample(**sample_args)
  File "*/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py", line 967, in _mp_sample
    traces = Parallel(n_jobs=cores)(jobs)
  File "*/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "*/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "*/anaconda3/lib/python3.6/multiprocessing/pool.py", line 638, in get
    self.wait(timeout)
  File "*/anaconda3/lib/python3.6/multiprocessing/pool.py", line 635, in wait
    self._event.wait(timeout)
  File "*/anaconda3/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "*/anaconda3/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
KeyboardInterrupt
^C

This is really a minimal example: if I reduce n from 96 to 95 sampling succeeds! Various other models also work without problems. And the example above works correctly if set njobs = 1.

I can reproduce it on two Linux PCs, where everything python-related is installed in the same way using anaconda in a docker container.
Any ideas how to debug the issue?

junpenglao · February 12, 2018, 6:25am

This is likely a memory problem. I see it all the time on my mac… Setting njobs=1 helps because either the model can run, and if it fail now it wont fail silently.

aplavin · February 12, 2018, 6:28am

By memory problem you mean low memory, right? That’s surprising because the machines have 64 gigs of ram, and anyway, much larger models including GPs can be sampled (using njobs = 1).

junpenglao · February 12, 2018, 6:32am

hmm, in that case maybe it is a joblib problem, I remember seeing one in issues: https://github.com/pymc-devs/pymc3/issues/2640 not sure if it is related though.

aplavin · February 12, 2018, 6:37am

It’s very difficult to guess what’s the problem as it doesn’t raise any exception - just stalls indefinitely. Maybe there is something like “verbose” mode in pymc, or similar?

junpenglao · February 12, 2018, 6:42am

There is a theano verbose mode, but in this case the problem is likely joblib related. Maybe you can try turning on the verbose in joblib https://pythonhosted.org/joblib/parallel.html

aplavin · February 12, 2018, 6:53am

Well, the joblib verbose mode doesn’t seem to help at all: it prints many messages of the form Pickling array (shape=(*,), dtype=float32)., and still no errors while sampling stalls.

madarshahian · May 11, 2018, 8:36pm

I am curious if there is any update on this post? I have similar problem.

aplavin · May 11, 2018, 8:51pm

Yeah, I also still experience this issue, and not only for the model mentioned in the first post. Entirely different classes of models don’t sample in multiple processes, so that’s not related to gaussian processes as I thought at first.

junpenglao · May 11, 2018, 9:58pm

I suspect it is something to do with joblib pickling - unfortunately cannot really pin down where went wrong yet.

magoarcano · November 3, 2018, 4:16pm

It worked for me using:
trace_g = pm.sample(1100, cores=1)

no0ne · September 6, 2019, 2:44am

Was this issue ever resolved? I seem to be experiencing the same problem, but only on linux. I have two identical anaconda environments: one on my desktop running macOS, and the other on a linux box in a rack running ubuntu. If I sample with chains=2 and cores=1, everything runs fine on both machines. If I sample with chains=2 and cores=2, then running on the mac works fine while running on linux results in the sampling sitting at 0% indefinitely until I kill it. I’m not really sure how to begin diagnosing the issue as there is no error, and the python environments are identical as far as I can tell.

fbartolic · September 6, 2019, 9:59am

I’m experiencing the same issue with certain models on macOS as well…

aseyboldt · September 6, 2019, 1:36pm

The multicore sampling is completely rewritten by now, so if there still is an issue with frozen chains, it clearly is different. Are you using the most recent pymc3 release? Does this happen sporadically or regularly with one model (if so, which)?

kix2 · January 17, 2020, 9:28am

I also have this issue. I have the latest version, I just installed the module last week. For me basically it used to work on multiple cores and now it only works on one. same model, same code. I’m running on Python 3.7 on a Mac os

aseyboldt · January 17, 2020, 9:40am

Can you share the model that shows this behaviour? Also, what exactly is happening?

kix2 · January 17, 2020, 9:54am

This is the model. The issue is: until 2 days ago I could sample 4 chains on 8 cores with no issue. Now, if I run it with more than one core, the sampler just doesn’t start.

aseyboldt · January 17, 2020, 9:57am

Can you write a minimal model that I can run myself?
What happend between now and 2 days ago? Did you update (pymc or something else)? From what version to which? Conda or pip?
What exactly does “the sampler just doesn’t start” mean? Do you still get the progress bar, but it stays at 0 draws, do you get an error message…

kix2 · January 17, 2020, 10:16am

We can use this notebook I found online, which also doesn’t work: https://github.com/WillKoehrsen/Data-Analysis/blob/master/bayesian_lr/Bayesian%20Linear%20Regression%20Project.ipynb (cells 39-40)

By ‘sampler doesn’t start’ I mean it’s forever frozen at this screen:

So nothing really happened actually. I was running a model on 4 chains, 8 cores, target_accept=.9 because I was following this tutorial here (https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html ), but it was 2 am so I went to sleep after the model finished running, and then the next day (without even restarting the jupyter kernel, it didn’t work anymore - by didn’t work I mean the sampled wouldn’t start like in the image above).

Anyway, I suspected I had some memory issues, or there were other processes running on my computer so I closed everything, restarted everything and at some point attempted reinstalling pymc over pip (just like it was installed originally).

kix2 · January 17, 2020, 10:48am

This is a very random update, but I didn’t want to waste any more of your time. I did a force reinstall of all the packages in my virtual python environment, and now it somehow works. I’m really sorry for commenting here, I was trying to get this to work again for the last 2 days with no avail. I have no idea what happened still. Anyway, thank you for the super quick reply I really appreciate the entire package as I’m trying to learn my way through probabilistic modelling.

Topic		Replies	Views
NUT sampler stuck under windows with njobs>1 Questions	13	4337	April 27, 2020
Parallel analysis Questions	8	706	September 27, 2018
Simple example fails with njobs>1 on Windows OS Development bug	6	2234	April 22, 2018
Pymc3 getting stuck after initialization Questions	41	8873	February 9, 2022
Sample with multiple cores Questions	3	1456	September 10, 2020

Sampling doesn't start when njobs > 1 for some models

Related topics