Sampling doesn't start when njobs > 1 for some models

Recently I noticed that I can’t sample certain models with several parallel chains, while others work ok. For a minimal example of what doesn’t work:

import numpy as np
import pymc3 as pm

n = 96
X = np.random.rand(n, 1)
observed = np.random.rand(n)

with pm.Model() as model:
    cov = pm.gp.cov.ExpQuad(1, 1)
    gp = pm.gp.Latent(mean_func=pm.gp.mean.Zero(), cov_func=cov)
    gp_f = gp.prior('gp_f', X=X)
    val_obs = pm.Normal('val_obs', mu=gp_f, sd=0.1, observed=observed)

    trace = pm.sample(njobs=2)

No matter if run in a separate python script or from a notebook, the output is the following:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [gp_f_rotated_]
  0%|                                                                              | 0/1000 [00:00<?, ?it/s]

Then it keeps indefinitely in this state, not loading the CPU at all. The traceback when interrupting the process after some waiting:

Traceback (most recent call last):
  File "tmp.py", line 14, in <module>
    trace = pm.sample(njobs=2)
  File "*/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py", line 437, in sample
    trace = _mp_sample(**sample_args)
  File "*/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py", line 967, in _mp_sample
    traces = Parallel(n_jobs=cores)(jobs)
  File "*/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "*/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "*/anaconda3/lib/python3.6/multiprocessing/pool.py", line 638, in get
    self.wait(timeout)
  File "*/anaconda3/lib/python3.6/multiprocessing/pool.py", line 635, in wait
    self._event.wait(timeout)
  File "*/anaconda3/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "*/anaconda3/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
KeyboardInterrupt
^C

This is really a minimal example: if I reduce n from 96 to 95 sampling succeeds! Various other models also work without problems. And the example above works correctly if set njobs = 1.

I can reproduce it on two Linux PCs, where everything python-related is installed in the same way using anaconda in a docker container.
Any ideas how to debug the issue?

This is likely a memory problem. I see it all the time on my mac… Setting njobs=1 helps because either the model can run, and if it fail now it wont fail silently.

By memory problem you mean low memory, right? That’s surprising because the machines have 64 gigs of ram, and anyway, much larger models including GPs can be sampled (using njobs = 1).

hmm, in that case maybe it is a joblib problem, I remember seeing one in issues: https://github.com/pymc-devs/pymc3/issues/2640 not sure if it is related though.

It’s very difficult to guess what’s the problem as it doesn’t raise any exception - just stalls indefinitely. Maybe there is something like “verbose” mode in pymc, or similar?

There is a theano verbose mode, but in this case the problem is likely joblib related. Maybe you can try turning on the verbose in joblib https://pythonhosted.org/joblib/parallel.html

Well, the joblib verbose mode doesn’t seem to help at all: it prints many messages of the form Pickling array (shape=(*,), dtype=float32)., and still no errors while sampling stalls.

I am curious if there is any update on this post? I have similar problem.

Yeah, I also still experience this issue, and not only for the model mentioned in the first post. Entirely different classes of models don’t sample in multiple processes, so that’s not related to gaussian processes as I thought at first.

1 Like

I suspect it is something to do with joblib pickling - unfortunately cannot really pin down where went wrong yet.

1 Like

It worked for me using:
trace_g = pm.sample(1100, cores=1)

Was this issue ever resolved? I seem to be experiencing the same problem, but only on linux. I have two identical anaconda environments: one on my desktop running macOS, and the other on a linux box in a rack running ubuntu. If I sample with chains=2 and cores=1, everything runs fine on both machines. If I sample with chains=2 and cores=2, then running on the mac works fine while running on linux results in the sampling sitting at 0% indefinitely until I kill it. I’m not really sure how to begin diagnosing the issue as there is no error, and the python environments are identical as far as I can tell.

I’m experiencing the same issue with certain models on macOS as well…

The multicore sampling is completely rewritten by now, so if there still is an issue with frozen chains, it clearly is different. Are you using the most recent pymc3 release? Does this happen sporadically or regularly with one model (if so, which)?

I also have this issue. I have the latest version, I just installed the module last week. For me basically it used to work on multiple cores and now it only works on one. same model, same code. I’m running on Python 3.7 on a Mac os

Can you share the model that shows this behaviour? Also, what exactly is happening?

This is the model. The issue is: until 2 days ago I could sample 4 chains on 8 cores with no issue. Now, if I run it with more than one core, the sampler just doesn’t start.

Can you write a minimal model that I can run myself?
What happend between now and 2 days ago? Did you update (pymc or something else)? From what version to which? Conda or pip?
What exactly does “the sampler just doesn’t start” mean? Do you still get the progress bar, but it stays at 0 draws, do you get an error message…

We can use this notebook I found online, which also doesn’t work: https://github.com/WillKoehrsen/Data-Analysis/blob/master/bayesian_lr/Bayesian%20Linear%20Regression%20Project.ipynb (cells 39-40)

By ‘sampler doesn’t start’ I mean it’s forever frozen at this screen:

So nothing really happened actually. I was running a model on 4 chains, 8 cores, target_accept=.9 because I was following this tutorial here (https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html ), but it was 2 am so I went to sleep after the model finished running, and then the next day (without even restarting the jupyter kernel, it didn’t work anymore - by didn’t work I mean the sampled wouldn’t start like in the image above).

Anyway, I suspected I had some memory issues, or there were other processes running on my computer so I closed everything, restarted everything and at some point attempted reinstalling pymc over pip (just like it was installed originally).

This is a very random update, but I didn’t want to waste any more of your time. I did a force reinstall of all the packages in my virtual python environment, and now it somehow works. I’m really sorry for commenting here, I was trying to get this to work again for the last 2 days with no avail. I have no idea what happened still. Anyway, thank you for the super quick reply I really appreciate the entire package as I’m trying to learn my way through probabilistic modelling.

2 Likes