I noticed that the tutorial on AEVB uses `scale_cost_to_minibatch`

option in contrast to another tutorial involving an encoder (convolutional VAE). I looked it up in the documentation and in the code but the information on this option is scarce. Setting this option seems to affect the constants in front of the KL terms here but which constants and how? Does it just scale the whole loss function to the total size by the factor `(total_size/minibatch_size)`

or does it scale only the KL terms similar to what you would do in \beta-VAE?

Hi, `scale_cost_to_minibatch`

is worth using when gradients explode. It affects only gradient term. When training on full set loss looks like

ELBO = \log p(\mathcal{D}|\theta) - KL(q(\theta)||p(\theta)) \to \underset{\theta}{\max}

and yields gradient

\nabla_\theta ELBO

When training with minibatches, KL term is rescaled, since we need to correct estimation bias of \hat{ELBO}

\hat{ELBO} = \log p(\mathcal{D}_b|\theta) - \tfrac{b}{N}KL(q(\theta)||p(\theta)) \to \underset{\theta}{\max}

One property of this estimate:

E{\tfrac{N}{b}\nabla_\theta \hat{ELBO}} = E\nabla_\theta ELBO

You can see that multiplication is needed to correct scale (not direction) bias (this is done here). However this is not usually desirable. If minibatches are too small, variance of gradient is quite large because of \log p(\mathcal{D_b}|\theta), correction term \tfrac{N}{b} further increases that variance. In this case you need to manipulate your learning rate to achieve better convergence properties. Moreover, learning rate is dependent of data size, tricks with learning rate become data dependent too. To make life easier, I decided to make this optional and set default to `scale_cost_to_minibatch=True`

(done here).

Thanks @ferrine, that was a very clear explanation.

So, to summarize, `scale_cost_to_minibatch`

just switches between ELBO when `False`

and \hat{ELBO} and when `True`

by re-scaling the whole cost function by either N/b or 1 respectively (correct me if I got it wrong).

A quick follow up question on scaling: I tried implementing a \beta-VAE, where the KL term is scaled by some factor \beta with \beta=1 being the usual ELBO. Would it be enough to just update the scaling with `x.scaling=beta*x.scaling`

after initializing `x`

with `x=Normal('x',...)`

, or is there something else I should keep in mind? There is also the warmup trick which uses \beta=\beta(i) where \beta interpolates from \beta(0)=0 to \beta(\infty)=1 during training, does it make sense to set `x.scaling=shared(0)`

and then update `x.scaling`

with a callback, or is there more to it?

Setting an attribute is the wrong way. This scaling becomes a part of the graph. I would better provide a mutable `total_size`

for observed it should be `batch*beta`

, where beta is shared

Do not set beta to zero there, youâ€™ll get nans

Thanks for the hint but it seem that the `total_size`

accepts only ints or lists of ints on pymc3 master, see the definition here.

Would it make sense to add a separate scale parameter to the prior definition as in

`pm.Normal("x",0,1,shape=(10,1),total_size=100,scale=0.1)`

? Using `total_size`

in this context seems like a bit of a hack. Although arguably tweaking the scale in the KL term could also be considered a hack.

Hmm, yeah, thatâ€™s right, Iâ€™ve missed that. I can try to add a new KL objective during ď¸Ź home, that doesnâ€™t seem that hard, just a bit of refactoring

Just to clarify, as far as I understand, \beta-VAE and the warm up trick are about re-scaling KL terms but only for *some* of the local variables. So Iâ€™m not sure if adding another KL objective would do the job, as presumably this would scale KL terms for *all* variables.

I thought kl warm up is applied to all the variables. Iâ€™m about to submit a pull request, symbolic scaling can be the next pr

PR is almost done, you can try it out. I think that \beta is hacking the objective and this thing should not be exposed to most of users. Thus Iâ€™ve decided to leave it in the base KLqp Inference class as an optional argument. You can create an Approximation and pass it there.

```
approx = pm.MeanField()
beta = theano.shared(1.)
inference = pm.KLqp(approx, beta=beta)
```

Wow, that was quick! Iâ€™ll give it a go next week.