Chain failure with an exponential model

Hi community,
I am trying to model 5 dependent variables (Observed mean response time comp, Observed mean response time incomp, Predicted mean decision time comp, Predicted mean decision time incomp, Predicted mean nondecision time) using an exponential model with the decay rate (beta) constrained to be identical between dependent variables and priors based on theory :

with pm.Model() as Model_1:

    α_meanRT_c = pm.HalfCauchy('α_meanRT_c',2)  
    α_meanRT_i = pm.HalfCauchy('α_meanRT_i', 2) 
    α_meanDT_c = pm.HalfCauchy('α_meanDT_c',2)  
    α_meanDT_i = pm.HalfCauchy('α_meanDT_i', 2)     
    
    α_meanTer = pm.HalfCauchy('α_meanTer', 2)      
 
    β = pm.Normal('β', 0.334, 0.1)#Kail (1991)
    
    γ_meanRT_c = pm.Normal('γ_meanRT_c', Mean_RT_comp_General[-1], .1)
    γ_meanRT_i = pm.Normal('γ_meanRT_i', Mean_RT_incomp_General[-1], .1)    
    γ_meanDT_c = pm.Normal('γ_meanDT_c', pred_meanDT_comp[-1], .1)
    γ_meanDT_i = pm.Normal('γ_meanDT_i', pred_meanDT_incomp[-1], .1)     
    γ_meanTer = pm.Normal('γ_meanTer', mean_param_per_group_model[-1, 2], .1)  

    σ_meanRT_c = pm.HalfNormal('σ_meanRT_c', .1)
    σ_meanRT_i = pm.HalfNormal('σ_meanRT_i', .1)
    σ_meanDT_c = pm.HalfNormal('σ_meanDT_c', .1)
    σ_meanDT_i = pm.HalfNormal('σ_meanDT_i', .1)    
    σ_meanTer = pm.HalfNormal('σ_meanTer', .1)     
    
    μ_meanRT_c = α_meanRT_c*pm.math.exp(-β*Age_General) + γ_meanRT_c
    μ_meanRT_i = α_meanRT_i*pm.math.exp(-β*Age_General) + γ_meanRT_i      
    μ_meanDT_c = α_meanDT_c*pm.math.exp(-β*Age_General) + γ_meanDT_c
    μ_meanDT_i = α_meanDT_i*pm.math.exp(-β*Age_General) + γ_meanDT_i    
    μ_meanTer = α_meanTer*pm.math.exp(-β*Age_General) + γ_meanTer        
   
    y_meanRT_c = pm.Normal('y_meanRT_c', μ_meanRT_c, σ_meanRT_c, observed=Mean_RT_comp_General)
    y_meanRT_i = pm.Normal('y_meanRT_i', μ_meanRT_i, σ_meanRT_i, observed=Mean_RT_incomp_General)  
    y_meanDT_c = pm.Normal('y_meanDT_c', μ_meanDT_c, σ_meanDT_c, observed=pred_meanDT_comp)
    y_meanDT_i = pm.Normal('y_meanDT_i', μ_meanDT_i, σ_meanDT_i, observed=pred_meanDT_incomp)      
    
    trace_Model_1 = pm.sample (2000, chains = 4, cores = 4, tune = 2000, target_accept = .95)

I frequently (although not systematically) get chain failures, for example :

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 110, in run
    self._start_loop()
  File "/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
    point, stats = self._compute_point()
  File "/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `σ_meanTer_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanDT_i_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanDT_c_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanRT_i_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanRT_c_log__`.ravel()[0] is zero.
The derivative of RV `γ_meanTer`.ravel()[0] is zero.
The derivative of RV `γ_meanDT_i`.ravel()[0] is zero.
The derivative of RV `γ_meanDT_c`.ravel()[0] is zero.
The derivative of RV `γ_meanRT_i`.ravel()[0] is zero.
The derivative of RV `γ_meanRT_c`.ravel()[0] is zero.
The derivative of RV `β`.ravel()[0] is zero.
The derivative of RV `α_meanTer_log__`.ravel()[0] is zero.
The derivative of RV `α_meanDT_i_log__`.ravel()[0] is zero.
The derivative of RV `α_meanDT_c_log__`.ravel()[0] is zero.
The derivative of RV `α_meanRT_i_log__`.ravel()[0] is zero.
The derivative of RV `α_meanRT_c_log__`.ravel()[0] is zero.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `σ_meanTer_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanDT_i_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanDT_c_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanRT_i_log__`.ravel()[0] is zero.
The derivative of RV `σ_meanRT_c_log__`.ravel()[0] is zero.
The derivative of RV `γ_meanTer`.ravel()[0] is zero.
The derivative of RV `γ_meanDT_i`.ravel()[0] is zero.
The derivative of RV `γ_meanDT_c`.ravel()[0] is zero.
The derivative of RV `γ_meanRT_i`.ravel()[0] is zero.
The derivative of RV `γ_meanRT_c`.ravel()[0] is zero.
The derivative of RV `β`.ravel()[0] is zero.
The derivative of RV `α_meanTer_log__`.ravel()[0] is zero.
The derivative of RV `α_meanDT_i_log__`.ravel()[0] is zero.
The derivative of RV `α_meanDT_c_log__`.ravel()[0] is zero.
The derivative of RV `α_meanRT_i_log__`.ravel()[0] is zero.
The derivative of RV `α_meanRT_c_log__`.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-32-5184d81c6fa2> in <module>()
     46     y_meanDT_i = pm.Normal('y_meanDT_i', μ_meanDT_i, σ_meanDT_i, observed=pred_meanDT_incomp)
     47 
---> 48     trace_Model_1 = pm.sample (2000, chains = 4, cores = 4, tune = 2000, target_accept = .95)
     49 
     50 folder = Path(r'/gpfs22/home/servanm/enfants/chains/dataset1/Model_1')

/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, **kwargs)
    435             _print_step_hierarchy(step)
    436             try:
--> 437                 trace = _mp_sample(**sample_args)
    438             except pickle.PickleError:
    439                 _log.warning("Could not pickle model, sampling singlethreaded.")

/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, **kwargs)
    967         try:
    968             with sampler:
--> 969                 for draw in sampler:
    970                     trace = traces[draw.chain - chain]
    971                     if (trace.supports_sampler_stats

/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/parallel_sampling.py in __iter__(self)
    391 
    392         while self._active:
--> 393             draw = ProcessAdapter.recv_draw(self._active)
    394             proc, is_last, draw, tuning, stats, warns = draw
    395             if self._progress is not None:

/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
    295             else:
    296                 error = RuntimeError("Chain %s failed." % proc.chain)
--> 297             raise error from old_error
    298         elif msg[0] == "writing_done":
    299             proc._readable = True

RuntimeError: Chain 0 failed.

I am a little stuck here. What is the cause of these failures and what should I do to fix the issue ?

The data, in case somebody would like to replicate the problem:

Age in years
[ 6 7 8 9 10 11 12 13 14 27]
Observed mean RT comp
[0.82029 0.81442 0.67686 0.60575 0.60129 0.5627 0.52159 0.49382 0.46922
0.38829]
Observed mean RT incomp
[0.88702 0.88427 0.74223 0.64513 0.63799 0.58742 0.55774 0.52565 0.51099
0.42094]
Predicted mean DT comp
[0.2945 0.38196 0.2287 0.18751 0.18723 0.17296 0.15984 0.12904 0.12739
0.10536]
Predicted mean DT incomp
[0.35579 0.43869 0.29425 0.22501 0.22126 0.19745 0.19284 0.16187 0.16684
0.14178]
Predicted mean nondecision time
[0.52991 0.4387 0.44892 0.4197 0.41642 0.3896 0.36394 0.36589 0.34392
0.2829 ]

Likely after the exponential transformation some values becomes too large, and the gradient at the tail is usually near 0.
You should:

  1. check the test_value to see if they are indeed too large, if so
  2. set more informative prior, and specify test value so they dont blow up initially.

Many thanks for the help. I called print Model_1.test_point just before calling pm.sample and got the following output (the chain crashed again) :

{'α_meanRT_c_log__': array(0.69315), 'α_meanRT_i_log__': array(0.69315), 'α_meanDT_c_log__': array(0.69315), 'α_meanDT_i_log__': array(0.69315), 'α_meanTer_log__': array(0.69315), 'β': array(0.334), 'γ_meanRT_c': array(0.38829), 'γ_meanRT_i': array(0.42094), 'γ_meanDT_c': array(0.10536), 'γ_meanDT_i': array(0.14178), 'γ_meanTer': array(0.2829), 'σ_meanRT_c_log__': array(-2.52838), 'σ_meanRT_i_log__': array(-2.52838), 'σ_meanDT_c_log__': array(-2.52838), 'σ_meanDT_i_log__': array(-2.52838), 'σ_meanTer_log__': array(-2.52838)}

Could you try the debugging step here: How to track a 'nan energy'?

Sure, below are the results of the debugging:

make sure all test_value are finite
Model_1.test_point
{'α_meanRT_c_log__': array(0.69315), 'α_meanRT_i_log__': array(0.69315), 'α_meanDT_c_log__': array(0.69315), 'α_meanDT_i_log__': array(0.69315), 'α_meanTer_log__': array(0.69315), 'β': array(0.334), 'γ_meanRT_c': array(0.38829), 'γ_meanRT_i': array(0.42094), 'γ_meanDT_c': array(0.10536), 'γ_meanDT_i': array(0.14178), 'γ_meanTer': array(0.2829), 'σ_meanRT_c_log__': array(-2.52838), 'σ_meanRT_i_log__': array(-2.52838), 'σ_meanDT_c_log__': array(-2.52838), 'σ_meanDT_i_log__': array(-2.52838), 'σ_meanTer_log__': array(-2.52838)}
make sure all logp are finite
make sure the potentials are all finite
p0
[-0.00411 -1.25549 -0.20598 -0.88218 -0.04209  1.26855 -2.24396 -0.9653
  0.01162 -0.09209 -1.3392   0.37139 -0.50212  1.05752  0.28284 -1.03444]
start.energy
-22.99133532589326
make sure model logp and its gradients are finite
logp
30.209137606795274
dlogp
[   0.36338   -7.32429   -7.32606   24.4479    17.45026    0.
   11.82309    2.85706  202.20002  183.56605 -377.3904     0.
    1.07376   -1.09739   26.09152   22.68059]
make sure velocity is finite
v
[-0.00411 -1.25549 -0.20598 -0.88218 -0.04209  1.26855 -2.24396 -0.9653
  0.01162 -0.09209 -1.3392   0.37139 -0.50212  1.05752  0.28284 -1.03444]
kinetic
7.217802280902015 

I don’t see non-finite elements. Similar to the post you mention, the problem appears to be solved by removing the jitter from adapt_diag. Could you please explain why?

Glad you solved it. Yes the jitter sometimes could cause problem, especially if you have random variable in a small scale, jitter might make the proposal/test value invalid.

Ok Thanks. I just tried to compare the above model with a similar model that allows the decay parameter of the exponential function free to vary across my dependent variables. I got some warnings during LOO and WAIC computations:

/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/stats.py:558: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  ics.append((n, ic_func(t, m, pointwise=True)))
/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/stats.py:300: UserWarning: Estimated shape parameter of Pareto distribution is
        greater than 0.7 for one or more samples.
        You should consider using a more robust model, this is because
        importance sampling is less likely to work well if the marginal
        posterior and LOO posterior are very different. This is more likely to
        happen with a non-robust model and highly influential observations.
  happen with a non-robust model and highly influential observations.""")
/gpfs22/home/servanm/myenv/lib/python3.6/site-packages/pymc3/stats.py:219: UserWarning: For one or more samples the posterior variance of the
        log predictive densities exceeds 0.4. This could be indication of
        WAIC starting to fail see http://arxiv.org/abs/1507.04544 for details
        
  """)

Might this issue be related to the previous one ? How can it be fixed ?
Thanks again for the help, very much appreciated.

That’s not directly related to the previous error, but it is something to do with prior in some way. It is a bit more difficult to get rid of because you would need to play with the scale of your prior to see what works.