NaN occurred in optimization (error in DL with variational minibatches)

sachinruk · March 20, 2018, 6:19am

I have the following Deep Learning model:

X = theano.shared(X_train)
Y = theano.shared(np.where(y_train==1)[1])
X_off = theano.shared(X_offset_train)

h = [n_features, 10, 10, 1]
inits = []
for i in range(len(h)-1):
    inits.append(np.random.randn(h[i], h[i+1])*
                 np.sqrt(2/(h[i] + h[i+1])))
    
with pm.Model() as dl_model:
    ws = []
    logit = X
    for i in range(len(h)-1):
        ws.append(pm.Normal('w{0}'.format(i), 0, sd=1, 
                            shape=(h[i], h[i+1]), 
                            testval=inits[i]))
        w_repeat = pm.math.block_diagonal([ws[-1]]*max_horse)
        logit = tt.nnet.relu(tt.dot(logit, w_repeat))

    p = tt.nnet.softmax(logit + X_off)
    out = pm.Categorical('out', p, observed=Y)

During the posterior inference process I get a “FloatingPointError: NaN occurred in optimization” error. The optimization is invoked via the following code block.

batch_size = 256
X_m = pm.Minibatch(X_train, batch_size)
X_offset_m = pm.Minibatch(X_offset_train, batch_size)
Y_m = pm.Minibatch(np.where(y_train==1)[1], batch_size)
with dl_model:
    approx = pm.fit(100000,
                    more_replacements={X:X_m, X_off: X_offset_m, Y:Y_m},
                    callbacks=[pm.callbacks.CheckParametersConvergence(tolerance=1e-4)])

The question is how can I go about debugging what values caused this error. Am I somehow able to run the last minibatch through this so that I can get the log-likelihood via the out variable? If not any tips on what may have gone wrong. None of my training data have NaNs in them.

Edit: I should have mentioned that this occurs at random iteration numbers. Last time it ran it was at iteration 28091/100000 before it stopped.

Edit 2: Turns out the solution in this case was not to have an activation in the last layer. So I had to have this in the model:

logit = tt.dot(logit, w_repeat) + b_repeat
if i < len(h) - 2:
    logit = tt.tanh(logit)

junpenglao · March 20, 2018, 7:56am

Have you tried the methods to debug in this post?

sachinruk · March 21, 2018, 3:05am

I changed the question to mention that this didn’t happen in the very first iteration. I did get it running by using BoundedNormal = pm.Bound(pm.Normal, lower=-2.0, upper=2.0) instead of pm.Normal on the weights. However, again when I had put the limits to +/- 3 it still halted the inference process due to a NaN. Even though it did complete the optimization the accuracy of the model was worse than a linear model.

Is there a way to check the offending sample? This might allow me to remove the Bounds somehow.

sachinruk · March 21, 2018, 4:21am

Ok, this is embarassing. Turns out the reason that it was stuffing up was because I had accidentally put in a relu at the last layer as well, instead of linear. Didn’t realise that having an activation in the output layer would have caused so much damage considering I’m still taking the softmax. Regardless for future issues, is there a way to check an offending sample which caused the NaN or Inf values in optimzation?

junpenglao · March 21, 2018, 5:55am

What do you mean by offending sample?

sachinruk · March 21, 2018, 6:01am

I’m assuming that when we were sampling the posterior, one of the samples had the value for one of the weights to be say 10000000. A ridiculously large number that somehow caused the log probability to be infinity. Similar argument for small values causing a small value through the softmax, causing NaNs.

I understand that VI is different in that we are optimising the mu/ sigma of the approximate posterior (q(w)), so maybe we had a large mu value?

junpenglao · March 21, 2018, 6:05am

I see, yes you can track the mu/sigma value, see http://docs.pymc.io/notebooks/variational_api_quickstart.html#Tracking-parameters

Topic		Replies	Views
FloatingPointError: NaN occurred in optimization in HMM Questions	5	1079	November 18, 2021
LDA example gives an Error Questions	3	472	August 13, 2018
How to use pymc3.fit() method Questions	16	3788	May 7, 2018
Minibatch Theano: shape mismatch: Questions	0	415	May 25, 2020
NaN occurred in optimization at first Iteration with ADVI Questions	3	746	February 18, 2020

NaN occurred in optimization (error in DL with variational minibatches)

Related topics