NaN occurred in optimization (error in DL with variational minibatches)

I have the following Deep Learning model:

X = theano.shared(X_train)
Y = theano.shared(np.where(y_train==1)[1])
X_off = theano.shared(X_offset_train)

h = [n_features, 10, 10, 1]
inits = []
for i in range(len(h)-1):
    inits.append(np.random.randn(h[i], h[i+1])*
                 np.sqrt(2/(h[i] + h[i+1])))
    
with pm.Model() as dl_model:
    ws = []
    logit = X
    for i in range(len(h)-1):
        ws.append(pm.Normal('w{0}'.format(i), 0, sd=1, 
                            shape=(h[i], h[i+1]), 
                            testval=inits[i]))
        w_repeat = pm.math.block_diagonal([ws[-1]]*max_horse)
        logit = tt.nnet.relu(tt.dot(logit, w_repeat))

    p = tt.nnet.softmax(logit + X_off)
    out = pm.Categorical('out', p, observed=Y)

During the posterior inference process I get a “FloatingPointError: NaN occurred in optimization” error. The optimization is invoked via the following code block.

batch_size = 256
X_m = pm.Minibatch(X_train, batch_size)
X_offset_m = pm.Minibatch(X_offset_train, batch_size)
Y_m = pm.Minibatch(np.where(y_train==1)[1], batch_size)
with dl_model:
    approx = pm.fit(100000,
                    more_replacements={X:X_m, X_off: X_offset_m, Y:Y_m},
                    callbacks=[pm.callbacks.CheckParametersConvergence(tolerance=1e-4)])

The question is how can I go about debugging what values caused this error. Am I somehow able to run the last minibatch through this so that I can get the log-likelihood via the out variable? If not any tips on what may have gone wrong. None of my training data have NaNs in them.

Edit: I should have mentioned that this occurs at random iteration numbers. Last time it ran it was at iteration 28091/100000 before it stopped.

Edit 2: Turns out the solution in this case was not to have an activation in the last layer. So I had to have this in the model:

logit = tt.dot(logit, w_repeat) + b_repeat
if i < len(h) - 2:
    logit = tt.tanh(logit)

Have you tried the methods to debug in this post?

I changed the question to mention that this didn’t happen in the very first iteration. I did get it running by using BoundedNormal = pm.Bound(pm.Normal, lower=-2.0, upper=2.0) instead of pm.Normal on the weights. However, again when I had put the limits to +/- 3 it still halted the inference process due to a NaN. Even though it did complete the optimization the accuracy of the model was worse than a linear model.

Is there a way to check the offending sample? This might allow me to remove the Bounds somehow.

Ok, this is embarassing. Turns out the reason that it was stuffing up was because I had accidentally put in a relu at the last layer as well, instead of linear. Didn’t realise that having an activation in the output layer would have caused so much damage considering I’m still taking the softmax. Regardless for future issues, is there a way to check an offending sample which caused the NaN or Inf values in optimzation?

What do you mean by offending sample?

I’m assuming that when we were sampling the posterior, one of the samples had the value for one of the weights to be say 10000000. A ridiculously large number that somehow caused the log probability to be infinity. Similar argument for small values causing a small value through the softmax, causing NaNs.

I understand that VI is different in that we are optimising the mu/ sigma of the approximate posterior (q(w)), so maybe we had a large mu value?

I see, yes you can track the mu/sigma value, see http://docs.pymc.io/notebooks/variational_api_quickstart.html#Tracking-parameters