206 samples :: Mass matrix contains zeros on the diagonal

Hi team,
long time listener, first time caller, love your work - thank you!

I have been installing pymc on some new pi4s and I run into a very repeatable problem.

I have models I’ve built and run successfully on a pi3b+ previously and without changes I have tried to use them as the test case for setting up my new boards. They get to 206 samples and return the ominous ‘Mass matrix contains zeros on the diagonal’, I read about it and saw plenty of discussion around the model definition, though it struck me as odd that I haven’t changed anything since previously using these models successfully. I picked up some ‘hello world’ type examples from the pymc pages and I hit the exact same error at the exact same location, 206 samples, all variables, every time, regardless of the model.

Hence my belief that I might be running into some library or compatibility issues as the common element of difference to the existing set up is that the failing cases are all new debian buster, newly installed pymc3, numpy, & theano. So I expect (I have since upgraded the old board also) that I might have been on an older numpy version or something like this?

I am hoping I have overlooked something simple in installing, anyway here is the problem as cleanly as I can give it to see what you might know of it;

N.b. everything has been installed with ‘sudo pip3 install…’ which I point out as I saw some other discussion about ‘master’ vs ‘stable’. The following is on a newly rebooted process, nothing else going on on the machine.

pi@pi4-1:~ $ python3
Python 3.7.3 (default, Dec 20 2019, 18:57:59) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pymc3 as pm
>>> pm.__version__
'3.8'
>>> import numpy as np
>>> np.__version__
'1.18.4'
>>> import theano
>>> theano.__version__
'1.0.4'
>>> 
>>> size = 100
>>> true_intercept = 1
>>> true_slope = 2
>>> 
>>> x = np.linspace(0, 1, size)
>>> 
>>> # y = a + b*x
... true_regression_line = true_intercept + true_slope * x
>>> 
>>> # add noise
... y = true_regression_line + np.random.normal(scale=.5, size=size)
>>> 
>>> # Add outliers
... x_out = np.append(x, [.1, .15, .2])
>>> y_out = np.append(y, [8, 6, 9])
>>> 
>>> data = dict(x=x_out, y=y_out)
>>> 
>>> with pm.Model() as model:
...     pm.glm.GLM.from_formula('y ~ x', data)
...     trace = pm.sample(2000, cores=2)
... 
<pymc3.glm.linear.GLM object at 0xb08a4eb0>
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sd, x, Intercept]
Sampling 2 chains, 0 divergences:   4%|▌             | 206/5000 [00:15<05:57, 13.42draws/s]
pymc3.parallel_sampling.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 110, in run
    self._start_loop()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
    point, stats = self._compute_point()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `sd_log__`.ravel()[0] is zero.
The derivative of RV `x`.ravel()[0] is zero.
The derivative of RV `Intercept`.ravel()[0] is zero.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 110, in run
    self._start_loop()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
    point, stats = self._compute_point()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `sd_log__`.ravel()[0] is zero.
The derivative of RV `x`.ravel()[0] is zero.
The derivative of RV `Intercept`.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/local/lib/python3.7/dist-packages/pymc3/sampling.py", line 469, in sample
    trace = _mp_sample(**sample_args)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/sampling.py", line 1059, in _mp_sample
    for draw in sampler:
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 394, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 297, in recv_draw
    raise error from old_error
RuntimeError: Chain 0 failed.

As I said, it’s very repeatable for the environments I’m testing, pi3b+, pi4, new models as well as old, complex as well as simple. All stop right at 206.

Hi Peter,
So, to be clear: your models were running on an older version of PyMC, and now they break on PyMC 3.8 (stable, not master)?

They were running on a pi3b±Stretch, with (I think, no longer have it) PyMC 3.7, Numpy 1.17, Theano 1.0.4. Now trying on a new setup on pi4-Buster with PyMC 3.8, Numpy 1.18, Theano 1.0.4, and the pi3b+ is wiped/refreshed and is now running the updated linux Buster also, and new python, PyMC, Numpy & Theano.

As a little test, prior to replying here now, I have tried with all the combinations of (PyMC[3.6,3.7,3.8] ; Numpy[1.16,1.17,1.18], Theano[1.0.4]), printed version numbers align with the changing installations but all give the exact same error at precisely the same point in sample execution.

As they’re newly setup, it’s possible I’m missing some dependencies somewhere, but I’m not seeing any errors or warnings of such, obvs pip handles python dependencies, but could there be something on the compile side, flags or libraries, that are out of place?

More exploring this afternoon and I have been able to run models using Metropolis & DMetropolis steps, NUTS being the default is obviously the initial issue, and directly using HMC also yields the same 206 error.

I’m going to attempt also to downgrade my gcc to what it might have been previously.

Can you try changing the init method? --> pm.sample(init="adapt_diag").
As the GLM module’s priors are very wide, that could be a cause of your error.

I have tried that now, and simplified the priors and I get the same problem;

>>> with pm.Model() as model:
...     mu = pm.Normal('mu', mu=0, sigma=1)
...     obs = pm.Normal('obs', mu=mu, sigma=1, observed=np.random.randn(100))
...     trace = pm.sample(1000, tune=500, init="adapt_diag")
... 
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [mu]
Sampling 2 chains, 0 divergences:   7%|████▎                                                          | 206/3000 [00:10<02:20, 19.90draws/s]
pymc3.parallel_sampling.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 110, in run
    self._start_loop()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
    point, stats = self._compute_point()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `mu`.ravel()[0] is zero.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 110, in run
    self._start_loop()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
    point, stats = self._compute_point()
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `mu`.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/usr/local/lib/python3.7/dist-packages/pymc3/sampling.py", line 469, in sample
    trace = _mp_sample(**sample_args)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/sampling.py", line 1059, in _mp_sample
    for draw in sampler:
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 394, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "/usr/local/lib/python3.7/dist-packages/pymc3/parallel_sampling.py", line 297, in recv_draw
    raise error from old_error
RuntimeError: Chain 1 failed.

I thought this was an interesting insight to what is going wrong, not sure if this is obvious or not, but it seemed informative to me…

>>> from theano.printing import Print as tt_print
>>> with pm.Model() as model:
...     mu = pm.Normal('mu', mu=0, sigma=1)
...     mu_p = tt_print('mu')(mu)
...     obs = pm.Normal('obs', mu=mu_p, sigma=1, observed=np.random.randn(100))
...     trace = pm.sample(1000,tune=500, init="adapt_diag")
... 
mu __str__ = 0.0
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
mu __str__ = 0.0
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [mu]
Sampling 2 chains, 0 divergences:   0%|                                                                         | 0/3000 [00:00<?, ?draws/s]mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
mu __str__ = 0.0
etc., etc., etc.

I’ve found this question with similar symptoms and it has worked for me, init='advi'

>>> from theano.printing import Print as tt_print
>>> with pm.Model() as model:
...     mu = pm.Normal('mu', mu=0, sigma=1)
...     # mu_p = tt_print('mu')(mu)
...     obs = pm.Normal('obs', mu=mu, sigma=1, observed=np.random.randn(100))
...     trace = pm.sample(1000,tune=500, init="advi")
... 
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average Loss = 142.29:   4%|██▉                                                                     | 8299/200000 [00:02<01:03, 3018.93it/s]
Convergence achieved at 8300
Interrupted at 8,299 [4%]: Average Loss = 148.34
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [mu]
Sampling 2 chains, 0 divergences:  35%|█████████████████████▌                                        | 1043/3000 [02:15<05:29,  5.94draws/s]
1 Like

Ok, so it seems like it was an initialization problem, as I suspected. What’s weird though is that I just ran this model smoothly on the master branch.
@aseyboldt, could be related to the issue I talked to you about last week :man_shrugging:

I unmarked the solution, as I found the advi was also buggy. It initialises with values but then the sampling doesn’t change the values, so it passes though I get 1 as acceptance prob. and non-meaningful results with n spikes where n is number of chains initialised at different values.

I tried a few other inits using this simple test;

import pymc3 as pm
pm.__version__
import numpy as np
np.__version__
import theano
theano.__version__
from theano.printing import Print as tt_print
with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sigma=1)
    mu_p = tt_print('mu')(mu)
    obs = pm.Normal('obs', mu=mu_p, sigma=1, observed=np.random.randn(100))
    trace = pm.sample(1000,tune=500, init=<choices>)

And I get that;
adapt_diag: Fail, mu is all zeros,
jitter+adapt_diag: Fail, mu is two values alternating (two chains?), recurring
advi+adapt_diag: ADVI: Success, Sampling: Fail, mu is one of two values altern. recurr.
advi+adapt_diag_grad: same as advi+adapt_diag
advi: ADVI:Success, Sampling: passes but mu values alternate between two, so trace is not meaningful, ie; acceptance probability ==1.
advi_map: as per advi.

Agree it may not be a branch problem, but is it possible something relied upon is unavailable silently for me? Being newly set up machines I may not have some basic libraries which haven’t shown themselves as missing through other activities yet?

Is there something to provide more verbose debug I could include in the test sampling…?

TBH, I think I have exhausted my recommendations here :confused:
I tend to think it’s related to the machine you’re using, in the sense that I can run this model with default init without issues on my machine. I’m afraid I’ll have to defer to others for more specifi advice however.

No problem, thanks for your help so far!

I’ve just attempted a completely clean start, and from a new Debian Buster image I installed nothing but the required libraries and then pymc3 and I get the same error. :thinking:

Damn! Maybe @canyon289 or @lucianopaz will have an idea of what’s going on.

From what I’ve seen, it looks like you are installing everything with pip into the system python. Did you try using conda to deal with the installation? Using conda usually pulls in all the necessary c compiled libraries that theano needs to work. Another alternative to make this reproducible would be to try setting up a docker image that can either be based on Debian buster or Debian stretch, and try to install everything into the system python in there.

2 Likes

Hi guys,
I just got a new pi4 8gb, installed 64bit rasbian and attempted pymc3 as the first task after making compile and netcdf and h5py libraries available.

I get the exact same error as above.

The conda environment doesn’t seem to be there for arm64 yet, and I’m not much into installing layers of systems. I have tried conda to no success, docker just seems like a couple hours setting up and changes to my environment that will live longer than I’d wish for a test, all to potentially reach the same result.

This is 100% reproducible for me, straight out of the box following https://docs.pymc.io/ gets me this error without fail (pun intended).

Thanks,
Peter

@Peter_James Apologies on not seeing this earlier.

To be frank computational libraries on the arm architectures have always been hit or miss for me. Its been suggested above but I would try docker. I’m surprised docker would take a couple of hours to setup on a Linux machine though. What is the holdup in that process?