How momentum work (ADAM) in the pymc3 framework?

I’m looking at the framework for variational inference, because I would like to evaluate the “Noisy ADAM” variant described here:

I kind of get the framework (an operator – in this case KL – plus an approximation – in this case MeanField – plus a step function – in this case ADAM)

Looking at the implementation, I can’t quite see how the momentum actually works:

    t = t_prev + 1
    a_t = learning_rate / (one - beta1**t)

    for param, g_t in zip(params, all_grads):
        value = param.get_value(borrow=True)
        m_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                               broadcastable=param.broadcastable)
        u_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                               broadcastable=param.broadcastable)

        m_t = beta1 * m_prev + (one - beta1) * g_t
        u_t = tt.maximum(beta2 * u_prev, abs(g_t))
        step = a_t * m_t / (u_t + epsilon)

        updates[m_prev] = m_t
        updates[u_prev] = u_t
        updates[param] = param - step

    updates[t_prev] = t

it seems to me that m_prev and u_prev are, for every iteration, always equal to a 0 matrix, effectively dropping out the momentum component.

[Edit: Okay, it made sense literally 1 second after posting. This just builds up the symbolic updates, and the zero matrix is the initial value of the gradients; and even adds an update to set m_prev to m_t, and ensures that it happens in the correct order]

Also an additional question: the approach here seems to be to generate a new value for each (theano) node via symbolic_sample_over_posterior. For an Approximation will these be samples over the variational distribution? Is there a way to make these available to the update rather than the (approximate) gradients alone? If not then it looks like VOGN (from https://arxiv.org/abs/1906.02506) is just ADAM with slightly different weights.

2 Likes