I’m looking at the framework for variational inference, because I would like to evaluate the “Noisy ADAM” variant described here:
I kind of get the framework (an operator – in this case KL – plus an approximation – in this case MeanField – plus a step function – in this case ADAM)
Looking at the implementation, I can’t quite see how the momentum actually works:
t = t_prev + 1 a_t = learning_rate / (one - beta1**t) for param, g_t in zip(params, all_grads): value = param.get_value(borrow=True) m_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype), broadcastable=param.broadcastable) u_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype), broadcastable=param.broadcastable) m_t = beta1 * m_prev + (one - beta1) * g_t u_t = tt.maximum(beta2 * u_prev, abs(g_t)) step = a_t * m_t / (u_t + epsilon) updates[m_prev] = m_t updates[u_prev] = u_t updates[param] = param - step updates[t_prev] = t
it seems to me that
u_prev are, for every iteration, always equal to a 0 matrix, effectively dropping out the momentum component.
[Edit: Okay, it made sense literally 1 second after posting. This just builds up the symbolic updates, and the zero matrix is the initial value of the gradients; and even adds an update to set
m_t, and ensures that it happens in the correct order]
Also an additional question: the approach here seems to be to generate a new value for each (theano) node via
symbolic_sample_over_posterior. For an
Approximation will these be samples over the variational distribution? Is there a way to make these available to the
update rather than the (approximate) gradients alone? If not then it looks like VOGN (from https://arxiv.org/abs/1906.02506) is just ADAM with slightly different weights.