Effects of `scale_cost_to_minibatch`

pwl · June 26, 2018, 4:57pm

I noticed that the tutorial on AEVB uses scale_cost_to_minibatch option in contrast to another tutorial involving an encoder (convolutional VAE). I looked it up in the documentation and in the code but the information on this option is scarce. Setting this option seems to affect the constants in front of the KL terms here but which constants and how? Does it just scale the whole loss function to the total size by the factor (total_size/minibatch_size) or does it scale only the KL terms similar to what you would do in \beta-VAE?

junpenglao · June 27, 2018, 7:24am

Good question, @ferrine?

ferrine · June 27, 2018, 8:52am

Hi, scale_cost_to_minibatch is worth using when gradients explode. It affects only gradient term. When training on full set loss looks like
ELBO = \log p(\mathcal{D}|\theta) - KL(q(\theta)||p(\theta)) \to \underset{\theta}{\max}
and yields gradient
\nabla_\theta ELBO

When training with minibatches, KL term is rescaled, since we need to correct estimation bias of \hat{ELBO}
\hat{ELBO} = \log p(\mathcal{D}_b|\theta) - \tfrac{b}{N}KL(q(\theta)||p(\theta)) \to \underset{\theta}{\max}
One property of this estimate:
E{\tfrac{N}{b}\nabla_\theta \hat{ELBO}} = E\nabla_\theta ELBO
You can see that multiplication is needed to correct scale (not direction) bias (this is done here). However this is not usually desirable. If minibatches are too small, variance of gradient is quite large because of \log p(\mathcal{D_b}|\theta), correction term \tfrac{N}{b} further increases that variance. In this case you need to manipulate your learning rate to achieve better convergence properties. Moreover, learning rate is dependent of data size, tricks with learning rate become data dependent too. To make life easier, I decided to make this optional and set default to scale_cost_to_minibatch=True (done here).

pwl · June 27, 2018, 3:50pm

Thanks @ferrine, that was a very clear explanation.

So, to summarize, scale_cost_to_minibatch just switches between ELBO when False and \hat{ELBO} and when True by re-scaling the whole cost function by either N/b or 1 respectively (correct me if I got it wrong).

A quick follow up question on scaling: I tried implementing a \beta-VAE, where the KL term is scaled by some factor \beta with \beta=1 being the usual ELBO. Would it be enough to just update the scaling with x.scaling=beta*x.scaling after initializing x with x=Normal('x',...), or is there something else I should keep in mind? There is also the warmup trick which uses \beta=\beta(i) where \beta interpolates from \beta(0)=0 to \beta(\infty)=1 during training, does it make sense to set x.scaling=shared(0) and then update x.scaling with a callback, or is there more to it?

ferrine · June 28, 2018, 3:25am

Setting an attribute is the wrong way. This scaling becomes a part of the graph. I would better provide a mutable total_size for observed it should be batch*beta, where beta is shared

ferrine · June 28, 2018, 3:29am

Do not set beta to zero there, you’ll get nans

pwl · July 2, 2018, 2:37pm

Thanks for the hint but it seem that the total_size accepts only ints or lists of ints on pymc3 master, see the definition here.

Would it make sense to add a separate scale parameter to the prior definition as in
pm.Normal("x",0,1,shape=(10,1),total_size=100,scale=0.1)? Using total_size in this context seems like a bit of a hack. Although arguably tweaking the scale in the KL term could also be considered a hack.

ferrine · July 2, 2018, 2:57pm

Hmm, yeah, that’s right, I’ve missed that. I can try to add a new KL objective during ️ home, that doesn’t seem that hard, just a bit of refactoring

pwl · July 2, 2018, 3:20pm

Just to clarify, as far as I understand, \beta-VAE and the warm up trick are about re-scaling KL terms but only for some of the local variables. So I’m not sure if adding another KL objective would do the job, as presumably this would scale KL terms for all variables.

ferrine · July 2, 2018, 7:44pm

I thought kl warm up is applied to all the variables. I’m about to submit a pull request, symbolic scaling can be the next pr

ferrine · July 4, 2018, 10:39pm

PR is almost done, you can try it out. I think that \beta is hacking the objective and this thing should not be exposed to most of users. Thus I’ve decided to leave it in the base KLqp Inference class as an optional argument. You can create an Approximation and pass it there.

approx = pm.MeanField()
beta = theano.shared(1.)
inference = pm.KLqp(approx, beta=beta)

Usage with simple example

pwl · July 5, 2018, 10:09am

Wow, that was quick! I’ll give it a go next week.

Topic		Replies	Views
Adaptive Minibatch size Questions	1	622	September 23, 2018
Average loss and MiniBatch size Questions	3	519	July 10, 2018
Regularization for AEVB Questions	2	604	June 26, 2018
Having Trouble in Using Mini Batch ADVI for HAR dataset Questions	5	709	December 15, 2018
Running with minibatches (memory constraints) Questions	5	1204	January 24, 2018

Effects of `scale_cost_to_minibatch`

Related topics