Chain failure causing PyMC3 to error out


#1

I am building a family of models (for different logic gates, implemented in biological cells). I have 6 models for different gates (AND, OR, NOT, NAND, XOR, and XNOR). My models for all but XNOR work fine. However, when I try to train the XNOR models (train here is learning the actual continuous output response, starting from a prior that captures the intended behavior), I get a chain failure error that causes PyMC3 to error out. Here’s a backtrace, but I’m afraid I can’t translate the backtrace into guidance for figuring out what went wrong and how to fix it:

pymc3.parallel_sampling.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
    self._start_loop()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
    point, stats = self._compute_point()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 115, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 201, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `sigma_Input11_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input11_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input11_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input11_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input11_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input11_lowerbound__`.ravel()[2] is zero.
The derivative of RV `sigma_Input10_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input10_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input10_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input10_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input10_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input10_lowerbound__`.ravel()[2] is zero.
The derivative of RV `sigma_Input01_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input01_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input01_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input01_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input01_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input01_lowerbound__`.ravel()[2] is zero.
The derivative of RV `sigma_Input00_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input00_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input00_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input00_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input00_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input00_lowerbound__`.ravel()[2] is zero.
The derivative of RV `hyper_mu_mu_Input00_lowerbound__`.ravel()[0] is zero.
"""

The above exception was the direct cause of the following exception:

ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `sigma_Input11_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input11_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input11_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input11_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input11_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input11_lowerbound__`.ravel()[2] is zero.
The derivative of RV `sigma_Input10_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input10_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input10_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input10_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input10_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input10_lowerbound__`.ravel()[2] is zero.
The derivative of RV `sigma_Input01_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input01_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input01_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input01_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input01_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input01_lowerbound__`.ravel()[2] is zero.
The derivative of RV `sigma_Input00_log__`.ravel()[0] is zero.
The derivative of RV `sigma_Input00_log__`.ravel()[1] is zero.
The derivative of RV `sigma_Input00_log__`.ravel()[2] is zero.
The derivative of RV `mu_Input00_lowerbound__`.ravel()[0] is zero.
The derivative of RV `mu_Input00_lowerbound__`.ravel()[1] is zero.
The derivative of RV `mu_Input00_lowerbound__`.ravel()[2] is zero.
The derivative of RV `hyper_mu_mu_Input00_lowerbound__`.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "Three_Layer_Analysis.py", line 309, in <module>
    main()
  File "Three_Layer_Analysis.py", line 304, in main
    (entropy, model) = do_main(gate, trace_name=trace_name)
  File "Three_Layer_Analysis.py", line 281, in do_main
    cores=cores, tune=1000)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 440, in sample
    trace = _mp_sample(**sample_args)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 990, in _mp_sample
    for draw in sampler:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 305, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 223, in recv_draw
    six.raise_from(RuntimeError('Chain %s failed.' % proc.chain), old)
  File "<string>", line 3, in raise_from
RuntimeError: Chain 3 failed.

One initial question – is it a fatal error that a single chain failed, or can this just happen every now and then? Should I be trying to work around by recovering from this error (perhaps by discarding the failing chain)? Or does this indicate some major failure in the parameterization of the model? You will see from the trace that a number of bounded variables are involved (normals that are constrained to be greater than zero).


#2

Generally speaking, it indicates an error with your model or parameterization. It doesn’t just happen randomly! :smiley: As such, you shouldn’t just discard a chain when it fails: a failing chain tells you that you should consider reparameterizing your model or even respecifying your model entirely.

At the expense of some shameless self-promotion, I wrote a small cookbook on Bayesian modelling - I think you’ll find the last two sections on MCMC diagnostics and model diagnostics helpful here.