Mass matrix contains 0 s on the diag, derivative of some RVs is 0, sampler is stuck, but dlogp on testpoint seems fine

I guess I am a bit confused about what’s going on behind the scenes here. I am sampling from a fairly complicated model, and the sampler failed after 103 samples with the common error about 0s in the mass matrix:

So I played a bit with the hyperpriors to make them more informative, and I also tested practically similar likelihood for the problematic bit of the model, but the error won’t go away. To check what’s happening more in depth, I printed the value of the problematic RV (‘error_coeffs’):

The problem is that it doesn’t change value, i.e. it’s stuck on the point shown above, which makes sense if the dlogp is zero. However, p0, start energy, logp, dlogp, v, and kinetic all look good for the test value. In particular:

What I am confused about is how dlogp can be fine on the test value, but then be 0 when the sampler actually starts sampling, and so what could be causing this error? I am finding it hard to think of other ways to debug this. Thank you for the help!

P.S. The error happens with a different set of real data too with the same priors, so it can’t be that the initial value is too close to the prob argmax.

P.S.2 I am printing the value of error_coeffs with pm.Deterministic('error_print', tt.printing.Print('error')(error_coeffs)). How would I print the logp and the dlogp of this variable during sampling?

Update on this: after some digging, I haven’t solved the problem but I found out more about it. Since I got this error:

I decided to print the derivative of logp with respect to pop_alpha_sigma_log__ with:

pm.Deterministic('dlogp',
    tt.printing.Print('dlogp_wrt_pop_alpha_sigma_log')(
        tt.grad(
            model.logpt, model.vars[1],
            disconnected_inputs='warn')
    )
)

(I checked that the index 1 is pop_alpha_sigma_log__ by printing model.vars)

The problem is that this value is actually not zero:

Except for the first print on the testpoint before the sampling starts, where it is NaN:

image

By running

        for var in model.vars:
            pm.Deterministic(f'dlogp_{var.name}',
                tt.printing.Print(f'dlogp_wrt_{var.name}')(
                    tt.grad(model.logpt, var, disconnected_inputs='warn')
                )
            )

turns out that in fact the dlogp is not 0 wrt any of the variables:

Either the rabbit hole is deeper than I thought, or I am lost in a glass of water!

I am still stuck on this, and I realized that maybe I didn’t ask a precise enough question here (sorry for multiple posts, let me know if this is not okay). My question is: how can it be that there is an error saying that the derivative of a certain RV is 0, but when I print it that derivative is actually not 0? Does it indicate a bug in the software, must I be printing something wrong, or is it normal and I am missing something about what the error means?

Does the problem happens when sampler starts? if so setting init=adapt_diag might help (the default will jitter the initial position, which sometimes gives gradient 0 problem).
Otherwise, if you are seeing the problem some time after the sampler started, usually it is an indication of some problem in the prior - they contains too little information, which gives a flat geometry. More informative prior would help in those cases.

Thank you for the help! I have tried both init='adapt_diag' and fitting with a very precise prior, but neither solved the problem. However, the model runs fine when I fit fewer participants, and when I fix the alphas vector parameter, leaving only error_coeffs to be fitted. Could it be an overflow error? Is there any way to check for that?

What is confusing me also is why the error comes up despite the gradients of the logpt looking fine. Could it have to do with the fact that the movement proposed by the gradient moves the sampler to an ‘unreasonable’ part of the parameter space?

Look like your model might be over parameterized wrt to alpha, maybe you can share your model code

The error occurred during the leapfrog, so it is not necessary the location you are taking gradient, but already a few leapfrogs away.

Here is the model with a sample dataset (where the model doesn’t fail however, as it’s much smaller than the actual dataset). It’s quite complicated!

import numpy as np
import pymc3 as pm
import theano as tt
import theano.tensor as T

def create_model(num_participants, num_states, possible_signals_array,
            real_signals_indices, types, picked_signals_indices,
            picsizes_values, participants_indices, states_values):

    with pm.Model() as model:

        types = tt.shared(types, name='types')

        pop_alpha_mu = pm.HalfNormal(
            'pop_alpha_mu', sigma=3)
        pop_alpha_sigma = pm.HalfNormal(
            'pop_alpha_sigma', sigma=3)

        alphas = pm.TruncatedNormal(
            "alphas",
            mu=pop_alpha_mu,
            sigma=pop_alpha_sigma,
            lower=0,
            shape=num_participants
        )

        error_coeffs_alpha = pm.HalfNormal(
            'pop_error_coeffs_alpha', sigma=1)
        error_coeffs_beta = pm.TruncatedNormal(
            'pop_error_coeffs_beta', mu=3, sigma=1.5, lower=0)

        error_coeffs = pm.Beta(
            'error_coeffs', 
            alpha=error_coeffs_alpha,
            beta=error_coeffs_beta,
            shape=num_participants
        )

        considered_signals = (types.dimshuffle(0, 'x') >= 
            types.dimshuffle('x', 0))
        min_picsize = min(picsizes_values)
        probability_accept = T.zeros(shape=(len(picsizes_values),
            len(real_signals_indices)))

        real_signals_indices = tt.shared(real_signals_indices,
            name='real_signals_indices')
        for index_state, state in enumerate(range(min_picsize, num_states)):

            local_poss_sig_array = tt.shared(
                possible_signals_array[state], name='local_poss_sig_array'
            )

            real_signals_array = local_poss_sig_array[real_signals_indices]
            unique_alt_profiles, index_signals_profile = T.extra_ops.Unique(
                axis=0, return_inverse=True)(considered_signals)
            unique_alt_profiles_expanded = T.tile(
                unique_alt_profiles[np.newaxis, :,:, np.newaxis],
                reps=(alphas.shape[0],1,1,local_poss_sig_array.shape[-1])
            )
            language_l = local_poss_sig_array / local_poss_sig_array.sum(
                axis=-1, keepdims=True)
            l0 = language_l[np.newaxis,:,:]

            l0_extended = T.tile(l0[:,np.newaxis,:,:],
                reps=(1, unique_alt_profiles.shape[0],1,1))
            
            unnorm_s1 = l0_extended ** alphas[
                :,np.newaxis, np.newaxis, np.newaxis]
            s1_unnorm = T.switch(unique_alt_profiles_expanded, unnorm_s1, 0)
            s1 = s1_unnorm / s1_unnorm.sum(axis=-2, keepdims=True)

            s1_filled = T.switch(unique_alt_profiles_expanded, s1, 1)
            l2 = s1_filled / s1_filled.sum(axis=-1, keepdims=True)

            language_possible = l2[:, index_signals_profile,
                T.arange(l2.shape[2]), :]
            l2_language = language_possible[:,real_signals_indices,:]
            
            unnorm_s3 = l2_language**alphas[:,np.newaxis,np.newaxis]
            s3 = unnorm_s3 / unnorm_s3.sum(axis=-2, keepdims=True)
    
            uniform_unnorm = T.ones_like(s3)
            uniform = uniform_unnorm / uniform_unnorm.sum(axis=-2, keepdims=True)
            error_params_expanded = error_coeffs[:,np.newaxis,np.newaxis]
            s3 = error_params_expanded*uniform+(1-error_params_expanded)*s3
    
            relevant_indices = (picsizes_values == state).nonzero()[0]
            subtensor = s3[
                participants_indices[relevant_indices],:,
                states_values[relevant_indices]]
            probability_accept = T.set_subtensor(
                probability_accept[relevant_indices],
                subtensor)

        # save the probability of acceptance
        pm.Deterministic("probability_accept", probability_accept)

        ### observed
        obs = pm.Categorical(
            "picked_signals", 
            p=probability_accept, 
            shape=len(picsizes_values), 
            observed=picked_signals_indices)

        step = pm.NUTS(target_accept=0.99)
        trace=pm.sample(step=step, cores=1)
    return trace


args = {
	'num_participants': 3,
	'num_states': 4,
	'possible_signals_array': [
	    np.array([[0],[0],[0], [1], [0], [1], [1], [1], [0], [0], [0], [1], [1], [0], [0], [1], [1]]),
	    np.array([[0, 1], [0, 1], [0, 1], [1, 1], [0, 1], [1, 0], [1, 0], [1, 0], [0, 1], [0, 1], [0, 1], [1, 0], [1, 0], [0, 1], [0, 1], [1, 0], [1, 0]]),
	    np.array([[0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 0], [1, 0, 0], [0, 1, 1], [0, 1, 1], [0, 0, 1], [1, 0, 0], [1, 1, 0], [0, 1, 1], [0, 0, 1], [1, 0, 0], [1, 1, 0]]),
	    np.array([[0, 0, 1, 1], [0, 0, 1, 1], [0, 0, 0, 1], [0, 0, 1, 0], [0, 0, 1, 1], [1, 0, 0, 0], [1, 1, 0, 0], [1, 0, 0, 0], [0, 1, 1, 1], [0, 1, 1, 1], [0, 0, 1, 1], [1, 0, 0, 0], [1, 1, 0, 0], [0, 1, 1, 1], [0, 0, 0, 1], [1, 0, 0, 0], [1, 1, 1, 0]])
    ], 
	'real_signals_indices': np.array([0, 1, 2, 3, 4, 5, 6, 7, 8]), 
	'types': np.array([1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2]), 
	'picked_signals_indices': np.array([4, 5, 6, 3, 2, 5, 3, 0, 0]), 
	'picsizes_values': np.array([1, 2, 2, 2, 3, 2, 2, 3, 3]), 
	'participants_indices': np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]), 
	'states_values': np.array([1, 2, 1, 1, 3, 0, 2, 3, 0])
 }

trace = create_model(**args)

I am wondering if there is any way to check if alpha is overparameterized? Usually I would check if it shows random walk like behaviour in the trace, but in this case there is no trace to look at! Another question: is there any way to print the various logp, dlogp etc for the frogleaps? Thank you so much for the help, I really appreciate it!