Effects of `scale_cost_to_minibatch`



I noticed that the tutorial on AEVB uses scale_cost_to_minibatch option in contrast to another tutorial involving an encoder (convolutional VAE). I looked it up in the documentation and in the code but the information on this option is scarce. Setting this option seems to affect the constants in front of the KL terms here but which constants and how? Does it just scale the whole loss function to the total size by the factor (total_size/minibatch_size) or does it scale only the KL terms similar to what you would do in \beta-VAE?


Good question, @ferrine?


Hi, scale_cost_to_minibatch is worth using when gradients explode. It affects only gradient term. When training on full set loss looks like
ELBO = \log p(\mathcal{D}|\theta) - KL(q(\theta)||p(\theta)) \to \underset{\theta}{\max}
and yields gradient
\nabla_\theta ELBO

When training with minibatches, KL term is rescaled, since we need to correct estimation bias of \hat{ELBO}
\hat{ELBO} = \log p(\mathcal{D}_b|\theta) - \tfrac{b}{N}KL(q(\theta)||p(\theta)) \to \underset{\theta}{\max}
One property of this estimate:
E{\tfrac{N}{b}\nabla_\theta \hat{ELBO}} = E\nabla_\theta ELBO
You can see that multiplication is needed to correct scale (not direction) bias (this is done here). However this is not usually desirable. If minibatches are too small, variance of gradient is quite large because of \log p(\mathcal{D_b}|\theta), correction term \tfrac{N}{b} further increases that variance. In this case you need to manipulate your learning rate to achieve better convergence properties. Moreover, learning rate is dependent of data size, tricks with learning rate become data dependent too. To make life easier, I decided to make this optional and set default to scale_cost_to_minibatch=True (done here).


Thanks @ferrine, that was a very clear explanation.

So, to summarize, scale_cost_to_minibatch just switches between ELBO when False and \hat{ELBO} and when True by re-scaling the whole cost function by either N/b or 1 respectively (correct me if I got it wrong).

A quick follow up question on scaling: I tried implementing a \beta-VAE, where the KL term is scaled by some factor \beta with \beta=1 being the usual ELBO. Would it be enough to just update the scaling with x.scaling=beta*x.scaling after initializing x with x=Normal('x',...), or is there something else I should keep in mind? There is also the warmup trick which uses \beta=\beta(i) where \beta interpolates from \beta(0)=0 to \beta(\infty)=1 during training, does it make sense to set x.scaling=shared(0) and then update x.scaling with a callback, or is there more to it?


Setting an attribute is the wrong way. This scaling becomes a part of the graph. I would better provide a mutable total_size for observed it should be batch*beta, where beta is shared


Do not set beta to zero there, you’ll get nans


Thanks for the hint but it seem that the total_size accepts only ints or lists of ints on pymc3 master, see the definition here.

Would it make sense to add a separate scale parameter to the prior definition as in
pm.Normal("x",0,1,shape=(10,1),total_size=100,scale=0.1)? Using total_size in this context seems like a bit of a hack. Although arguably tweaking the scale in the KL term could also be considered a hack.


Hmm, yeah, that’s right, I’ve missed that. I can try to add a new KL objective during :airplane:️ home, that doesn’t seem that hard, just a bit of refactoring


Just to clarify, as far as I understand, \beta-VAE and the warm up trick are about re-scaling KL terms but only for some of the local variables. So I’m not sure if adding another KL objective would do the job, as presumably this would scale KL terms for all variables.


I thought kl warm up is applied to all the variables. I’m about to submit a pull request, symbolic scaling can be the next pr


PR is almost done, you can try it out. I think that \beta is hacking the objective and this thing should not be exposed to most of users. Thus I’ve decided to leave it in the base KLqp Inference class as an optional argument. You can create an Approximation and pass it there.

approx = pm.MeanField()
beta = theano.shared(1.)
inference = pm.KLqp(approx, beta=beta)

Usage with simple example


Wow, that was quick! I’ll give it a go next week.