NUTS sampling doesn't work

Hi everybody,
I have problems using PymC3 in Jupiter Notebook on Windows 10. Here is an example what I’m currently trying:

import numpy as np
import pymc3 as pm

RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)

print("Running on PyMC3 v{}".format(pm.__version__))


#true parameters
alpha, sigma = 1, 1
beta = [1, 2.5]
# Size of dataset
size = 100

# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2

# Simulate outcome variable
Y = alpha + beta[0] * X1 + beta[1] * X2 + np.random.randn(size) * sigma

with basic_model:
    # draw 500 posterior samples
    trace = pm.sample(500, init='advi',cores=2)


az.plot_trace(trace);

So I create a trace and try to sample from this trace to build my posterior model. The problem is, everytime I run the code everything works out just fine until JupiterNotebook should sample the chains. As soon as I get to the point in inference when it is said: ‘Sampling 2 chains, 0 divergences: 0%| | 0/2000 [00:00<?, ?draws/s]’, the connection to the Kernel gets lost and from there on JupiterNotebook is unable to reconnect. Attached I send a picture of what happens.
I am completely clueless what’ s going wrong and why it won’t work… Hope some of you can help me with this!

Thanks in advance and best regards
Leon!

The picture: https://i.stack.imgur.com/75u3S.png

Can you share the code for your model to see what is happening?

I am working with the code above, the picture was just from the original project where the problem first occured. But same problem with the code above. I try to work with a code as simple as possible to prevent syntax failures.
I even made a small progress today figuring out that it must has something to do with the multi-processing. When I set chains and cores both 1, then the connection to the kernel doesn’t get lost… Is this a known problem and is there a way to fix it?

You don’t have any variables or likelihood in the code above, just the sampler call. Am I missing something?

Hi Leon

yes that is a known issue, mostly (if not exclusively) Windows-related.


However, from my experience the notebook tends to die when I call the sampler within a complex or mis-specified model. If you report in the thread the code for the model specification (re @luisroque) you might get some advice on what is causing the sampler to fail.

1 Like

Thanks a lot! Sorry I accidentally copied only half the code as it seems… @luisroque you are totally right there is oviously something missing, here is the missing code:
(it is just a simple tutorial example I was working with before actually trying to fit a model to my data, Link: https://docs.pymc.io/notebooks/getting_started.html)


with basic_model:

    # Priors for unknown model parameters
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    beta = pm.Normal("beta", mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal("sigma", sigma=1)

    # Expected value of outcome
    mu = alpha + beta[0] * X1 + beta[1] * X2

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)

So I searched a lot but didn’t really find a way around this problem in windows. I’ ve read somewhere that in Ubuntu it should work but to be honest I would really like to prevent to work with Ubuntu. Do you have any ideas how I could somehow bypass this problem in windows and make it running?

Hi Leon,

I ran your example on:

Windows 10
PyMC3 v3.9.3

Calling pm.sample() with the default options works like a charm with sampling being very fast and using all my cores (4 chains on 4 cores).

However, when I call pm.sample(init='advi) like in your code first I get the following complain (which I believe is not relevant to the current situation but for completeness…)

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
C:\Users\xxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gpuarray\dnn.py:184: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to a version >= v5 and <= v7.
  warnings.warn("Your cuDNN version is more recent than "

Then advi interrupts at 6 % with:

Convergence achieved at 13200
Interrupted at 13,199 [6%]: Average Loss = 235.24
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, beta, alpha]

And after that I get a [Errno 32] Broken pipe which is likely to signal that something when astray with multiprocessing, indeed if I look at the terminal I can see this beast:

 Process worker_chain_0:
Traceback (most recent call last):
  File "C:\Users\xxx\miniconda3\envs\workshop_env\lib\site-packages\pymc3\parallel_sampling.py", line 114, in _unpickle_step_method
    self._step_method = pickle.loads(self._step_method)
  File "C:\Users\xxx\miniconda3\envs\workshop_env\lib\site-packages\theano\compile\function_module.py", line 1082, in _constructor_Function
    f = maker.create(input_storage, trustme=True)
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\compile\function_module.py", line 1715, in create
    input_storage=input_storage_lists, storage_map=storage_map)
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\link.py", line 699, in make_thunk
    storage_map=storage_map)[:3]
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\vm.py", line 1091, in make_all
    impl=impl))
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\op.py", line 955, in make_thunk
    no_recycling)
  File "C:\Users\xxxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\op.py", line 858, in make_c_thunk
    output_storage=node_output_storage)
  File "C:\Users\xxxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\cc.py", line 1217, in make_thunk
    keep_lock=keep_lock)
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\cc.py", line 1157, in __compile__
    keep_lock=keep_lock)
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\cc.py", line 1624, in cthunk_factory
    key=key, lnk=self, keep_lock=keep_lock)
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\cmodule.py", line 1189, in module_from_key
    module = lnk.compile_cmodule(location)
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\cc.py", line 1527, in compile_cmodule
    preargs=preargs)
  File "C:\Users\xxxxxx\miniconda3\envs\workshop_env\lib\site-packages\theano\gof\cmodule.py", line 2399, in compile_str
    (status, compile_stderr.replace('\n', '. ')))
Exception: ('The following error happened while compiling the node', Reshape{0}(Subtensor{int64:int64:}.0, TensorConstant{[]}), '\n', 'Compilation failed (return status=3): ', '[Reshape{0}(<TensorType(float64, vector)>, TensorConstant{[]})]')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\pymc3\parallel_sampling.py", line 135, in run
    self._unpickle_step_method()
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\pymc3\parallel_sampling.py", line 116, in _unpickle_step_method
    raise ValueError(unpickle_error)
ValueError: The model could not be unpickled. This is required for sampling with more than one core and multiprocessing context spawn or forkserver.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\xxxxxx\miniconda3\envs\workshop_env\lib\multiprocessing\connection.py", line 312, in _recv_bytes
    nread, err = ov.GetOverlappedResult(True)
BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\xxxxxx\miniconda3\envs\workshop_env\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\pymc3\parallel_sampling.py", line 232, in _run_process
    _Process(*args).run()
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\pymc3\parallel_sampling.py", line 145, in run
    self._wait_for_abortion()
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\site-packages\pymc3\parallel_sampling.py", line 151, in _wait_for_abortion
    msg = self._recv_msg()
  File "C:\Users\xxxxxx\miniconda3\envs\workshop_env\lib\site-packages\pymc3\parallel_sampling.py", line 169, in _recv_msg
    return self._msg_pipe.recv()
  File "C:\Users\xxxxx\miniconda3\envs\workshop_env\lib\multiprocessing\connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "C:\Users\xxxxxx\miniconda3\envs\workshop_env\lib\multiprocessing\connection.py", line 321, in _recv_bytes
    raise EOFError

I am by no stretch of imagination knowledgeable in the PyMC3 backend but it seems quite evident that the first chain failed, causing problem with one of the workers which in response lead to the multiprocessing error (which is known to be a bit awk in Windows and IPython environment specifically).

THE ACTUAL ANSWER
My gut feeling here is that the initialization generated by advi created some sort of instability leading one of the chain to fail, I know there has been problems with the choice of init before (Initialization energy is NaN or Inf with jitter). As a test, try using the default options.

Unless I am completely off-track I believe you could get some extra insight from the PyMC3 team.

1 Like

Hey, thank you so much! Appreciate it!
So just as you said, it works fine with the default options, but always crashes with init=‘advi’. It even worked when I call init=‘nuts’ so it seems to be a problem especially with ‘advi’. ‘advi’ was comfortable to use cause it was quickly found a convergence but it seems as if it’s not suitable for windows. You said I might get further help rom the PyMC3 team, do you know how to get in contact with them?

Hi Leon,

if you can’t find an issue that suits your problem maybe try to open one on GitHub. Problems with advi seem to pop out quite frequently in discussions.

Here OP used advi

Hi Valerio,
thank you for all your help! I have tried a lot but it seems to be a fundamental problem with the initialization ‘advi’ in windows. It is obviously a problem that is long known, so I hope they’ll fix it with the next update and until then I simply use the default settings :smiley: