I’m looking at the framework for variational inference, because I would like to evaluate the “Noisy ADAM” variant described here:
I kind of get the framework (an operator – in this case KL – plus an approximation – in this case MeanField – plus a step function – in this case ADAM)
Looking at the implementation, I can’t quite see how the momentum actually works:
t = t_prev + 1
a_t = learning_rate / (one - beta1**t)
for param, g_t in zip(params, all_grads):
value = param.get_value(borrow=True)
m_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
u_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
m_t = beta1 * m_prev + (one - beta1) * g_t
u_t = tt.maximum(beta2 * u_prev, abs(g_t))
step = a_t * m_t / (u_t + epsilon)
updates[m_prev] = m_t
updates[u_prev] = u_t
updates[param] = param - step
updates[t_prev] = t
it seems to me that m_prev
and u_prev
are, for every iteration, always equal to a 0 matrix, effectively dropping out the momentum component.
[Edit: Okay, it made sense literally 1 second after posting. This just builds up the symbolic updates, and the zero matrix is the initial value of the gradients; and even adds an update to set m_prev
to m_t
, and ensures that it happens in the correct order]
Also an additional question: the approach here seems to be to generate a new value for each (theano) node via symbolic_sample_over_posterior
. For an Approximation
will these be samples over the variational distribution? Is there a way to make these available to the update
rather than the (approximate) gradients alone? If not then it looks like VOGN (from https://arxiv.org/abs/1906.02506) is just ADAM with slightly different weights.