VI fitting of custom pdf sporadically producing 'nan' parameter values which terminate fit

jstanley · February 25, 2021, 9:18pm

Context

I am attempting to fit a custom probability distribution using the VI methods. I am automating the fitting (and recording of results) so that I can apply it to ~100k separate data sets.

Data

The data points in each data set take on integer values only.

Model

The custom PDF is proportional to the product of a CDF function (specifically that of the EMG – _emg_cdf() below) and the survival function (i.e. 1 - CDF) of the normal distribution (_norm_sf() below). Because this product cannot be analytically normalized, I am numerically computing the normalization factor at each iteration (using a large integration range) in order to scale the CDF * SF product to unity. Here’s an example of the model PDF (blue) and an example of a corresponding dataset (orange):
example

Code

Both the _emg_cdf and _norm_sf functions are written using base theano functions. Furthermore, the numeric normalization method is also expressed in terms of theano functions (see _min, _max, and _x). The logp of the pdf is returned and is converted to a pymc3 variable using the pm.DensityDist method. See below for the model code (Note: I’m using the T.printing.Print calls to track the parameter values for each fitting iteration):

import numpy as np
import theano.tensor as T
import pymc3 as pm

# Build model
with pm.Model() as m:
    m0p=pm.Uniform('mL', lower=-200, upper=200)
    s0p=pm.Uniform('sL', lower=20, upper=2000)
    t0p=pm.Uniform('tI', lower=20, upper=2000)
    m1p=pm.Uniform('mT', lower=4500, upper=10500)
    s1p=pm.Uniform('sT', lower=30, upper=2000)
    
    m0 = T.printing.Print('m0')(m0p)
    s0 = T.printing.Print('s0')(s0p)
    t0 = T.printing.Print('t0')(t0p)
    m1 = T.printing.Print('m1')(m1p)
    s1 = T.printing.Print('s1')(s1p)

    # Custom distribution ========================================================
    def _emg_cdf(x, mu, sigma, tau):
        # z = (x - mu) / sigma
        def _norm_cdf(z):
            return 0.5 * (1 + T.erf(z / T.sqrt(2.0)))

        z = (x - mu) / sigma
        invK = sigma / tau

        exparg = invK * (0.5 * invK - z)
        logprod = exparg + T.log(_norm_cdf(z - invK))
        cdf =  T.abs_(_norm_cdf(z) - T.exp(logprod))
        return cdf


    def _norm_sf(x, mu, sigma):
        arg = (x - mu) / (sigma * T.sqrt(2.0))
        return 0.5 * T.erfc(arg)


    def custom_logp(x):
        # Unnormalized distribution value
        _prob_unscaled = (
            _emg_cdf(x, mu=m0, sigma=s0, tau=t0) *_norm_sf(x, mu=m1,sigma=s1)
        )

        # Compute norm factor by integrating over entire distribution
        _n = 10 #number of stdevs for numerical normalization
        _min = T.floor(T.min([m0 - _n*s0, m1 - _n*s1]))
        _max = T.ceil(T.max([m0 + _n*np.sqrt(s0**2 + t0**2), m1 + _n*s1]))
        _x = T.arange(_min, _max, dtype='int64')

        # Generate all values, to be summed for normalization factor
        _norm_array = (
            _emg_cdf(_x, mu=m0, sigma=s0, tau=t0) * _norm_sf(_x, mu=m1, sigma=s1)
        )
        _norm_factor = T.sum(_norm_array)

        # Scale distribution to true pdf
        custom_prob = _prob_unscaled / _norm_factor

        # Compute log-likelihood
        log_pdf = T.log(custom_prob)

        return log_pdf
    #=============================================================================

    # Convert Theano log-prob func into observed pymc3 distribution variable
    custom_pdf = pm.DensityDist('custom_pdf', custom_logp, observed=rvs)

Fitting Error

I’m fitting using the ADVI class. This process seems highly unstable and inconsistent—if I tweak the prior parameters a bit or change the random_seed value, the fitting process will sometimes complete without error, however much of the time the fit is halted on an ValueError due to some of the parameters taking on a nan value. Running following fitting code (after the above model code):

with m:
    vi = pm.ADVI(random_seed=42)
    approx = vi.fit(10000, progressbar=False)

Errors with the following:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\Miniconda3\envs\dowell\lib\site-packages\theano\compile\function_module.py in __call__(self, *args, **kwargs)
    902             outputs =\
--> 903                 self.fn() if output_subset is None else\
    904                 self.fn(output_subset=output_subset)

~\Miniconda3\envs\dowell\lib\site-packages\theano\gof\op.py in rval(p, i, o, n)
    891             def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
--> 892                 r = p(n, [x[0] for x in i], o)
    893                 for o in node.outputs:

~\Miniconda3\envs\dowell\lib\site-packages\theano\tensor\basic.py in perform(self, node, inp, out_)
   5542         step = step.item()
-> 5543         out[0] = np.arange(start, stop, step, dtype=self.dtype)
   5544 

ValueError: arange: cannot compute length

...[Ive omitted some of the error]...

ValueError: arange: cannot compute length
Apply node that caused the error: ARange{dtype='int64'}(Elemwise{Floor}[(0, 0)].0, Elemwise{Ceil}[(0, 0)].0, TensorConstant{1})
Toposort index: 99
Inputs types: [TensorType(float64, scalar), TensorType(float64, scalar), TensorType(int8, scalar)]
Inputs shapes: [(), (), ()]
Inputs strides: [(), (), ()]
Inputs values: [array(nan), array(nan), array(1, dtype=int8)]
Outputs clients: [[Elemwise{sub,no_inplace}(ARange{dtype='int64'}.0, InplaceDimShuffle{x}.0), Elemwise{sub,no_inplace}(ARange{dtype='int64'}.0, InplaceDimShuffle{x}.0)]]

Printing the parameter values for the last few iterations of the fit (using T.printing.Print) I see the nan values that produce the error:

...
m0 __str__ = -106.98177137073989
s0 __str__ = 564.1090064087261
t0 __str__ = 283.4875869828139
m1 __str__ = 5563.1240925511665
s1 __str__ = 485.17358772825014
m0 __str__ = 26.688642526520788
s0 __str__ = 153.88894549775893
t0 __str__ = 419.6344762064323
m1 __str__ = 6306.569380878885
s1 __str__ = 1364.1750774031318
m0 __str__ = nan
s0 __str__ = nan
t0 __str__ = nan
m1 __str__ = 5909.70120440209
s1 __str__ = 612.2725062193786
[ERRORS HERE]

This doesn’t appear to be a problem with my model. If I evaluate the model using the parameter values just prior to the nan iteration, there doesn’t appear to be an issues with the PDF. Furthermore, if I adjust the parameters for the prior distributions, sometimes the fitting procedure completes without error and produces a sensible result. I am wondering if there is some issue with the theano backend that is causing this. This wouldn’t be a big issue if I were simply running the fit on a single dataset, but I am attempting to apply this fitting procedure to 100k+ datasets and ~40% of them are failing with this same issue, making it difficult to automate the fitting. Any help would be much appreciated. Thanks!

One note: I can prodvide code to simulate the above example data, if needed.

Versions

import sys, scipy, numpy, pymc3
print(scipy.__version__, numpy.__version__, pymc3.__version__, sys.version_info)

1.6.0 1.18.1 3.8 sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)

ckrapu · February 26, 2021, 7:15pm

Without knowing too much about your dataset, I wonder if there could be overflow/underflow issues. Would you mind trying to repeat your results with bounded gradients for ADVI? the fit method used for fitting a variational approximation can take the argument total_grad_norm_constraint which bounds the gradient. You may also want to play around with the other keyword arguments for that function - maybe taking additional Monte Carlo samples for the gradient estimator could help. Also, are you running your computations in 64 bit (default) or 32 bit?

jstanley · February 26, 2021, 7:45pm

Thank you for the suggestions @ckrapu. For simplicity sake the data above was simulated. The following code should reproduce the dataset above:

from scipy import stats

def custom_pdf(x, m0, s0, t0, m1, s1):

    cdf = stats.exponnorm.cdf(x, t0/s0, m0, s0)
    sf = stats.norm.sf(x, m1, s1)
    unscaled = np.nan_to_num(cdf * sf)

    xmin = int(min(m0 - 10*s0, m1 - 10*s1))
    xmax = int(max(m0 + 10*np.sqrt(s0**2 + t0**2), m1 + 10*s1))
    xfull = np.array(range(xmin, xmax))
    
    cdf = stats.exponnorm.cdf(xfull, t0/s0, m0, s0)
    sf = stats.norm.sf(xfull, m1, s1)
    norm_factor = sum(np.nan_to_num(cdf * sf))

    pdf = np.nan_to_num(unscaled/norm_factor)
    return pdf


def custom_rvs(m0, s0, t0, m1, s1, size=5000, seed=42):

    xmin = int(min(m0 - 10*s0, m1 - 10*s1))
    xmax = int(max(m0 + 10*np.sqrt(s0**2 + t0**2), m1 + 10*s1))
    xfull = np.array(range(xmin, xmax))

    pdf = custom_pdf(xfull, m0=m0, s0=s0, t0=t0, m1=m1, s1=s1)

    # Adjust pdf to sum to 1.0 (residue from finite normalization integration)
    residue = 1.0 - sum(pdf)
    pdf[-1] = pdf[-1] + abs(residue)
        
    np.random.seed(seed=seed)
    samples = np.random.choice(xfull, size=size, replace=True, p=pdf)
    samples = np.sort(samples)
    return samples

x = np.array(range(-1000, 7000))
m0 = 0
s0 = 200
t0 = 200
m1 = 6000
s1 = 300
N = 5000
rvs = custom_rvs(m0, s0, t0, m1, s1)

The rvs array is that which I assigned to the observed= kwarg in the DensityDist call in my previous post.

I have tried to limit the gradient with the total_grad_norm_constraint but that didn’t solve it. I’ll try additional samples for the gradient estimate. Hopefully that helps. I’m running in 64 bit.

jstanley · February 26, 2021, 8:25pm

@ckrapu to answer your question, I ran the fit both with a constraint on the gradient and increased number of MC samples to estimate the gradient, as follows:

with m:
    vi = pm.ADVI(random_seed=42)
    
    approx = vi.fit(10000, progressbar=False, total_grad_norm_constraint=0.00001, obj_n_mc=10)

Which produced the same error (nan parameter values), at approximately the same iteration. If there is an over/underflow issue, do you think it would be in the gradient estimate or in model evaluation or somewhere else? If it were in the gradient estimate, presumably the total_grad_norm_constraint would prevent overflow, but is it possible that the gradient would underflow and assign nan parameter values?

ricardoV94 · February 26, 2021, 10:40pm

We recently detected some issues in how Theano optimized the expression of the normal cdf when written as you wrote: ERROR (theano.graph.opt): Optimization failure due to: local_grad_log_erfc_neg · Issue #291 · pymc-devs/aesara · GitHub

You may want to try using the one in pm.distributions.dist_math.std_cdf just to rule that out.

jstanley · February 27, 2021, 11:24pm

@ricardoV94 good to know that there’s efforts to address the normal cdf stability. Thanks for the suggestion. However, I tried to implement it in my model and it actually seems to exacerbate the problem (though perhaps I’m implementing it improperly). Can the pm.distributions.dist_math.std_cdf act as a direct replacement for the theano tensor object calculated with the T.erf function? I changed my code as follows (see the _norm_cdf and _norm_sf functions):

# Build model
with pm.Model() as m:
    m0p=pm.Uniform('mL', lower=-200, upper=200)
    s0p=pm.Uniform('sL', lower=20, upper=2000)
    t0p=pm.Uniform('tI', lower=20, upper=2000)
    m1p=pm.Uniform('mT', lower=4500, upper=10500)
    s1p=pm.Uniform('sT', lower=30, upper=2000)
    
    m0 = T.printing.Print('m0')(m0p)
    s0 = T.printing.Print('s0')(s0p)
    t0 = T.printing.Print('t0')(t0p)
    m1 = T.printing.Print('m1')(m1p)
    s1 = T.printing.Print('s1')(s1p)

    # Custom distribution ========================================================
    def _emg_cdf(x, mu, sigma, tau):
        # z = (x - mu) / sigma
        def _norm_cdf(z):
            # replacement norm cdf
            return pm.distributions.dist_math.std_cdf(z)

        z = (x - mu) / sigma
        invK = sigma / tau

        exparg = invK * (0.5 * invK - z)
        logprod = exparg + T.log(_norm_cdf(z - invK))
        cdf =  T.abs_(_norm_cdf(z) - T.exp(logprod))
        return cdf


    def _norm_sf(x, mu, sigma):
        # replacement survival function (1 - CDF)
        return 1.0 - pm.distributions.dist_math.std_cdf((x - mu) / sigma)


    def custom_logp(x):
        # Unnormalized distribution value
        _prob_unscaled = (
            _emg_cdf(x, mu=m0, sigma=s0, tau=t0) *_norm_sf(x, mu=m1,sigma=s1)
        )

        # Compute norm factor by integrating over entire distribution
        _n = 10 #number of stdevs for numerical normalization
        _min = T.floor(T.min([m0 - _n*s0, m1 - _n*s1]))
        _max = T.ceil(T.max([m0 + _n*np.sqrt(s0**2 + t0**2), m1 + _n*s1]))
        _x = T.arange(_min, _max, dtype='int64')

        # Generate all values, to be summed for normalization factor
        _norm_array = (
            _emg_cdf(_x, mu=m0, sigma=s0, tau=t0) * _norm_sf(_x, mu=m1, sigma=s1)
        )
        _norm_factor = T.sum(_norm_array)

        # Scale distribution to true pdf
        custom_prob = _prob_unscaled / _norm_factor

        # Compute log-likelihood
        log_pdf = T.log(custom_prob)

        return log_pdf
    #=============================================================================

    # Convert Theano log-prob func into observed pymc3 distribution variable
    custom_pdf = pm.DensityDist('custom_pdf', custom_logp, observed=rvs)

Because my custom pdf relies on the exponentially modified guassian CDF and the normal SF, I used the pm.distributions.dist_math.std_cdf function you suggested in both places.

Fitting in the same manner as before:

with m:
    vi = pm.ADVI(random_seed=42)
    approx = vi.fit(10000, progressbar=True)

Which errors due to nan parameter values after the first iteration:


 0.01% [1/10000 00:00<03:01 Average Loss = 44,724]

m0 __str__ = -33.05241765693242
s0 __str__ = 472.1671991666316
t0 __str__ = 1224.923195156382
m1 __str__ = 6601.104988840433
s1 __str__ = 1621.0981016921528
m0 __str__ = nan
s0 __str__ = nan
t0 __str__ = nan
m1 __str__ = 8449.086492394166
s1 __str__ = 413.52418448877586

---------------------------------------------------------------------------
ValueError: arange: cannot compute length
Apply node that caused the error: ARange{dtype='int64'}(Elemwise{Floor}[(0, 0)].0, Elemwise{Ceil}[(0, 0)].0, TensorConstant{1})
Toposort index: 101
Inputs types: [TensorType(float64, scalar), TensorType(float64, scalar), TensorType(int8, scalar)]
Inputs shapes: [(), (), ()]
Inputs strides: [(), (), ()]
Inputs values: [array(nan), array(nan), array(1, dtype=int8)]
Outputs clients: [[Elemwise{sub,no_inplace}(ARange{dtype='int64'}.0, InplaceDimShuffle{x}.0), Elemwise{sub,no_inplace}(ARange{dtype='int64'}.0, InplaceDimShuffle{x}.0)]]

Is it possible that my EMG CDF implementation is causing problems? Do you know if there’s an improved implementation of that function? I’m aware that it’s notoriously difficult to represent (see: BUG: stats: Fix exponnorm.cdf and exponnorm.sf for small K by WarrenWeckesser · Pull Request #13430 · scipy/scipy · GitHub).

ricardoV94 · February 28, 2021, 9:13am

I think you are using CDF normly. The problem is that it still underflows easily.

Any chances you can work in log scale all the time? Here was a recent post where working on the log scale helped avoid similar type of issues: Replacing nested for loops from JAGS code for cognitive model of temporal discounting - #3 by ricardoV94

jstanley · March 1, 2021, 9:01pm

@ricardoV94 thank you for the suggestion. However, expressing all the terms of the probability calculation in log space doesn’t seem to fix the underflow problem. Let me make sure I’m understanding what you’re suggesting: Since my probability calculation is given by

P(x) = \frac{CDF(x)\,\cdot\,SF(x)}{N}

(where CDF is the cdf of the ExGaussian distribution and SF is the survival function of the Normal distribution) and since my custom pdf function must return the logp, I should compute the logp values by distributing the log across the three terms in the probability distribution. Namely,

log( P(x)) = log(CDF(x)) + log(SF(x)) - log(N)

Thus, I replaced

# Unnormalized distribution value
_prob_unscaled = (
    _emg_cdf(x, mu=m0, sigma=s0, tau=t0) *_norm_sf(x, mu=m1,sigma=s1)
)

# Generate all values to be summed for normalization factor
_norm_array = (
    _emg_cdf(_x, mu=m0, sigma=s0, tau=t0) * _norm_sf(_x, mu=m1, sigma=s1)
)
_norm_factor = T.sum(_norm_array)

# Scale distribution to true pdf
custom_prob = _prob_unscaled / _norm_factor

# Compute log-likelihood
log_pdf = T.log(custom_prob)

with the following code

# Unnormalized distribution value
_logp_unscaled = (
    T.log(_emg_cdf(x, mu=m0, sigma=s0, tau=t0)) 
    +T.log(_norm_sf(x, mu=m1,sigma=s1))
)
_norm_factor = T.sum(_norm_array)

# Generate all values to be summed for normalization factor
_norm_array = (
    _emg_cdf(_x, mu=m0, sigma=s0, tau=t0) * _norm_sf(_x, mu=m1, sigma=s1)
)

# Scale distribution to true pdf
log_pdf = _logp_unscaled - T.log(_norm_factor)

Log is directly applied to the _emg_cdf, norm_sf and _norm_factor terms. However, the same error still occurs (whether or not I use my original construction for the norm cdf or the pm.distributions.dist_math.std_cdf)

Any ideas?

jstanley · March 1, 2021, 10:54pm

@ricardoV94 it seems the overflow is occurring both for the log of the ExGaussian CDF and for the log of the Normal SF. Picking some arbitrary parameter values and evaluating the functions I get the following:

m0 = 0
s0 = 500
t0 = 200
m1 = 1000
s1 = 150

def _log_emg_cdf(x, mu, sigma, tau):
    # z = (x - mu) / sigma
    def _norm_cdf(z):
        return 0.5 * (1.0 + T.erf(z / T.sqrt(2.0)))

    z = (x - mu) / sigma
    invK = sigma / tau

    exparg = invK * (0.5 * invK - z)
    logprod = exparg + T.log(_norm_cdf(z - invK))
    cdf =  T.abs_(_norm_cdf(z) - T.exp(logprod))
    return T.log(cdf)

x = np.array(range(-5000, 5000))

logcdf = _log_emg_cdf(x, m0, s0, t0).eval()
plt.figure(figsize=[12, 6])
plt.xlim(x.min(), x.max())
plt.xlabel('x')
plt.ylabel('Log of Exp. Gaussian CDF')
plt.plot(x, logcdf);

Returns:

Function values:

array([ -inf, -inf, -inf, ..., -3.20865223e-10, -3.19264837e-10, -3.17672555e-10])

As you can see the log cdf overflows at a relatively modest function value, ~-35 (as compared to the log of floating point precision, ~-700).

And for the SF (both my original form and that in pm.distributions.dist_math.std_cdf appear to be identical):

def _log_norm_sf(x, mu, sigma):
    arg = (x - mu) / (sigma * T.sqrt(2.0))
    cdf = 0.5 * T.erfc(arg)
    return T.log(cdf)

def _log_norm_sf_alt(x, mu, sigma):
    sf = 1.0 - pm.distributions.dist_math.std_cdf((x - mu) / sigma)
    return T.log(sf)

x = np.array(range(-5000, 10000))

logsf = _log_norm_sf(x, m1, s1).eval()
logsfalt = _log_norm_sf_alt(x, m1, s1).eval()

plt.figure(figsize=[12, 6])
plt.xlim(x.min(), x.max())
plt.xlabel('x')
plt.ylabel('Log of Normal SF')
plt.plot(x, logsf)
plt.plot(x, logsfalt, '--')
plt.legend(['Log SF', 'log SF (alt pymc3 version)'])
print('log SF:', logsf)
print('log SF alt:', logsfalt)

Returns:

Function values:
array([ 0., 0., 0., ..., -inf, -inf, -inf])

These both seem to overflow around the floating point limit.

Alternatively, I have tried using the logcdf subroutines of the ExGaussian and Normal pymc3 distributions:

pm.ExGaussian.dist(mu=, sigma=, nu=).logcdf(x)
pm.Normal.dist(mu=, sigma=).logcdf(x)

These seem to work, however they are extremely slow due to a lack of C implementation of the complimentary error function. Pymc3 returns the following warning:
WARNING (theano.tensor.opt): Optimization Warning: The Op erfcx does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.

Are there no better implementations of the ExGuassian CDF or Normal CDF to use in one’s model?

ricardoV94 · March 2, 2021, 8:11am

I was going to suggest you use the exgaussian cdf, but I see you already encountered it (I recently did some improvements in the stability of the logcdf method, in a more recent version of pymc: Improve ExGaussian logcdf and refactor logp by ricardoV94 · Pull Request #4407 · pymc-devs/pymc3 · GitHub). We are aware of the lack of C support for the erfcx method in theano, and we hope to have it fixed soon. Do you notice a sizeable slowdown besides the warning?

jstanley · March 2, 2021, 11:18pm

Nice work on the logcdf method. It works well. It does result in about a 10x slowdown (just roughly estimated from watching the progress bar), however it might not be a particular good comparison because my previous implementation of the cdf just wasn’t working (the one above). Once erfcx has C support, I’ll do a more appropriate comparison of speed.

To speed it up a bit, I pulled your code from the logcdf definition (here: pymc3/continuous.py at master · pymc-devs/pymc3 · GitHub) and removed the bound() function (since I’ve constrained my priors to be positive definite anyway) which helped. I look forward to the new and improved erfcx. Thanks again!

ricardoV94 · March 3, 2021, 9:34am

Actually there is a feature in pymc >= 3.11.0 that removes the bounds automatically. You use it by writing:

with pm.Model(check_bounds=False):
    # < same model as before >

This will remove all bounds in your model variables, which are unnecessary when you have a well specified model that cannot give invalid values for downstream distributions (But you have to be sure this is the case for all variables).

jstanley · March 3, 2021, 5:03pm

Ah, good to know. That’s a useful feature. Thanks.

Topic		Replies	Views
Sampling time is very long (minutes per sample) Questions	13	4611	November 2, 2017
Theano warnings when fitting a simple psychometric function (cumulative normal distribution) Questions theano	2	714	February 2, 2021
Averaging over model output that sometimes returns null values? Questions	12	1486	November 16, 2017
How to use pymc3.fit() method Questions	16	3891	May 7, 2018
NaN occurred in optimization with ADVI Questions	5	4673	August 6, 2019