I’m looking at the framework for variational inference, because I would like to evaluate the “Noisy ADAM” variant described here:

I kind of get the framework (an operator – in this case KL – plus an approximation – in this case MeanField – plus a step function – in this case ADAM)

Looking at the implementation, I can’t quite see how the momentum actually works:

```
t = t_prev + 1
a_t = learning_rate / (one - beta1**t)
for param, g_t in zip(params, all_grads):
value = param.get_value(borrow=True)
m_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
u_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
m_t = beta1 * m_prev + (one - beta1) * g_t
u_t = tt.maximum(beta2 * u_prev, abs(g_t))
step = a_t * m_t / (u_t + epsilon)
updates[m_prev] = m_t
updates[u_prev] = u_t
updates[param] = param - step
updates[t_prev] = t
```

it seems to me that `m_prev`

and `u_prev`

are, for every iteration, always equal to a 0 matrix, effectively dropping out the momentum component.

**[Edit: Okay, it made sense literally 1 second after posting. This just builds up the symbolic updates, and the zero matrix is the initial value of the gradients; and even adds an update to set m_prev to m_t, and ensures that it happens in the correct order]**

Also an additional question: the approach here seems to be to generate a new value for each (theano) node via `symbolic_sample_over_posterior`

. For an `Approximation`

will these be samples over the variational distribution? Is there a way to make these available to the `update`

rather than the (approximate) gradients alone? If not then it looks like VOGN (from https://arxiv.org/abs/1906.02506) is just ADAM with slightly different weights.