Help understanding why sampling fails in Zero-Inflated Gamma likelihood model

I have a hierarchical model in which the observation values range from 0 to 100 and the likelihood distribution is Gamma distributed. To handle the occurrences of 0, I am using a Zero-inflated mixture model where the probability of a zero occurring is Bernoulli and when the data is non-zero, it is Gamma Distributed. The class of the distribution is presented below.

I am having trouble sampling from the posterior for the hierarchical model mentioned above. There are three levels involved, including the observation level and to diagnose the issue I have set weakly informative Half Normal priors on the Gamma parameters for the two levels higher than the likelihood distribution.

I am unable to initialize NUTS if I specify the initialization to be ADVI. Somewhere between 1 and 2000 samples, I receive a FloatingPointError specifying that NaN occurred in optimization. I know this is happening because the inference fit is receiving a value of infinity in the step prior, but I am unable to figure out why this is occurring.

If I use the default initialization (jitter + adapt_diag), I am able to sample from the posterior if I specify the Gamma parameters to be alpha and beta. The resulting trace from a tune of 500 samples and draw of 1500 samples is shown below:

The problem occurs when I switch the Gamma parameters to be parameterized in terms of mu and sd. The sampler throws an error saying Bad initial energy: Model might be misspecified halfway through tuning. I don’t know how to diagnose this error, but I don’t believe the model is misspecified.

I need help figuring out why ADVI won’t work as the initialization method and also help figuring out why the mu/sd parameterization can’t be sampled. I would like to work with the mu/sd parameterization since it is easier to place prior distributions/beliefs around these values then the alpha/beta parameters.

Test data is below. In this particular test data, there aren’t any occurrences of 0 so I expect the pi values (probability of 0 in Bernoulli trials) to be essentially 0.

anonymized_test_data.csv (76.2 KB)

Here is the minimal example code needed to reproduce the issue.

data = pd.read_csv('anonymized_test_data.csv')

hierarchy = {
    'level_1': {
        'association': {'parent': 'network_group', 'child': 'network'}
    }, 
    'level_2': {
        'association': {'parent': 'network', 'child': 'network_program_id'}
    },
    'level_3': {
        'association': {'parent': 'network_program_id', 'child': 'observation'}
    } 
}

for level in hierarchy:
    association = hierarchy[level]['association']
    parent = association['parent']
    child = association['child']
    subset = data[[parent, child]].drop_duplicates().reset_index(drop=True)
    subset = subset.rename(columns={parent: 'parent', child: 'child'})
    hierarchy[level]['data'] = subset
    
observed = pd.merge(data[['observation', 'y']], hierarchy['level_3']['data'], left_on='observation', right_on='child')

class ZeroInflatedGamma(pm.Continuous):
    """
    Zero-inflated Gamma log-likelihood.
    
    Parameters
    ----------
    pi : float
        Expected proportion of zero-count occurences (0 < pi < 1)
    alpha : float
        Shape parameter (alpha > 0).
    beta : float
        Rate parameter (beta > 0).
    mu : float
        Alternative shape parameter (mu > 0).
    sd : float
        Alternative scale parameter (sd > 0).
    """
    def __init__(self, alpha=None, beta=None, mu=None, sd=None, pi=None, *args, **kwargs):
        super(ZeroInflatedGamma, self).__init__(*args, **kwargs)
        alpha, beta = self.get_alpha_beta(alpha=alpha, beta=beta, mu=mu, sd=sd)
        self.alpha = alpha
        self.beta = beta
        self.pi = pi = tt.as_tensor_variable(pi)
        self.gamma = pm.Gamma.dist(alpha, beta)
    
    def get_alpha_beta(self, alpha=None, beta=None, mu=None, sd=None):
        if (alpha is not None) and (beta is not None):
            pass
        elif (mu is not None) and (sd is not None):
            alpha = mu ** 2 / sd ** 2
            beta = mu / sd ** 2
        else:
            raise ValueError('Incompatible parameterization. Either use '
                             'alpha and beta, or mu and sd to specify '
                             'distribution.')

        return alpha, beta
    
    def random(self, point=None, size=None, repeat=None):
        alpha, beta, pi = distribution.draw_values(
            [self.alpha, self.beta, self.pi], point=point
        )
        
        gamma = distribution.generate_samples(
            stats.gamma.rvs, alpha, scale=1. / beta, dist_shape=self.shape, size=size
        )
        
        sampled = gamma * (np.random.random(np.squeeze(gamma.shape)) < (1 - pi))
        
        return distribution.reshape_sampled(sampled, size, self.shape)
    
    def logp(self, value):
        return tt.switch(
            tt.gt(value, 0), 
            tt.log1p(-self.pi) + self.gamma.logp(value), 
            tt.log(self.pi)
        )
with pm.Model() as model:        
    # Network-level parameters
    level_1_param_0_0 = pm.HalfNormal(
        'level_1_param_0_0', 
        sd=100,
        #sd=level_0_param_0_0[hierarchy['level_1']['data'].parent.values],
        shape=hierarchy['level_1']['data'].child.nunique(),
        testval=1
    )
    level_1_param_1_0 = pm.HalfNormal(
        'level_1_param_1_0',
        sd=100,
        #sd=level_0_param_1_0[hierarchy['level_1']['data'].parent.values],
        shape=hierarchy['level_1']['data'].child.nunique(),
        testval=1
    )

    # Priors for determining probability that a Bernoulli trial will result in success or not
    level_1_param_2_0 = pm.Normal(
        'level_1_param_2_0', 
        mu=0, 
        sd=25, 
        shape=hierarchy['level_1']['data'].child.nunique(),
        testval=1
    )   
    level_1_param_2_1 = pm.HalfNormal(
        'level_1_param_2_1', 
        5, 
        shape=hierarchy['level_1']['data'].child.nunique(),
        testval=1
    )   
      
    # Priors for program_id
    alpha = pm.HalfNormal(
        'alpha', 
        sd=level_1_param_0_0[hierarchy['level_2']['data'].parent.values],
        shape=hierarchy['level_2']['data'].child.nunique()
    )
    beta = pm.HalfNormal(
        'beta', 
        sd=level_1_param_1_0[hierarchy['level_2']['data'].parent.values],
        shape=hierarchy['level_2']['data'].child.nunique()
    )
    mu_pi_offset = pm.Normal(
        'mu_pi_offset', 
        mu=0, 
        sd=1, 
        shape=hierarchy['level_2']['data'].child.nunique()
    )
    mu_pi = pm.Deterministic(
        'mu_pi', 
        (level_1_param_2_0[hierarchy['level_2']['data'].parent.values] + 
         mu_pi_offset * level_1_param_2_1[hierarchy['level_2']['data'].parent.values])
    )  
    pi = pm.Deterministic('pi', pm.math.sigmoid(mu_pi))

    # Likelihood function
    y = ZeroInflatedGamma(
        'y', 
        alpha=alpha[hierarchy['level_3']['data'].parent.values], 
        beta=beta[hierarchy['level_3']['data'].parent.values], 
        pi=pi[hierarchy['level_3']['data'].parent.values],
        observed=observed.y
    )
    
with model:
    trace = pm.sample(1500, tune=500, progressbar=True, chains=1, njobs=4, nuts_kwargs={'target_accept':0.85})

The problem is likely that when you are parameterizing with mu and sd, some times the value fall out of support. For example, you can see here for a similar problem and discussion: Beta Regression in pyMC3 - #6 by bdyetton

Here you can find more information and potential solution to NaN occurred in optimization and Bad initial energy problem.

Hi junpenglao,

Thank you for your advice. So it seems the issue is that if the standard deviation draws are much greater than the mean this results in alpha/beta parameters which cause a draw of the gamma distribution outside of 0 to infinity?

Is there anything I can do to parameterize the Gamma distribution in terms of alpha and beta but have those parameters be a deterministic transform on mu and sd to ensure that the draws are in the support of gamma?

Also, how can I diagnose the NaN in optimization with ADVI?

Yes, you can create pm.Deterministic to save the mu and sd:

mu = pm.Deterministic('mu', alpha/beta)
sd = pm.Deterministic('sd', alpha/beta**2)

You can try to monitor the parameters during the approximation (http://docs.pymc.io/notebooks/variational_api_quickstart.html#Tracking-parameters), and check when there is a problem.

One additional somewhat related question.

My observed values are actually aggregated data coming from a sample. For each observed value I know the sample size used to derive the observed value.

I want to use that sample size information to consider the uncertainty of the observed value. My initial thoughts for this were to assume each observed value comes from a normal distribution with standard deviation dependent upon the sample size and have the mean be the observed value.

I would then like to feed this information into the Gamma distribution, but it’s not clear how to do this since the values from that Normal distribution are no longer observed.

So my question is how can I include the uncertainty related to the sample size?