How to choose initial values for MeanField sigma, and transform rho->sigma?

Hi everyone

We have implemented the same model in pymc3, pyro and numpyro ( but we are getting substantially different results when performing Variational Inference with Mean Field approximation. Specifically, inference with pymc3 appears to be substantially more stable whereas with pyro and numpyro ELBO starts to increase massively after about the first 10k iterations. So I looked into the differences between implementations, and found 2 that could be relevant:

  1. Pymc3 transforms rho->sigma parameter of Mean Field approximation with softplus whereas pyro uses exponentiation
  2. Pyro sets initial value of sigma to 0.1 whereas I was not able to find the values chosen in pymc3

Now I am trying to understand which of these ways to initialise and transform sigma are more motivated and whether these are important for training stability. Any help would be appreciated, including finding how pymc3 initialises sigma.

Hi @vitkl
This function initialises rho to all zeros over the flattened shape of parameters in Mean Field Approximation. I do sometimes find starting values unstable for a few hundred iterations. So, one good option can be to use all ones as its there in PyMC-TFP. I am not sure how one can pass starting values. So, let’s ping @ferrine for this.

One option I try many times is to increase samples size obj_n_mc that better approximates the gradient which leads to good convergence. Regarding the use of softplus, I find this discussion on TFP side super useful.


Hi @Sayam753

That’s great. Thanks for clarifying! The problem I am facing is actually poor stability at the end of training (see picture below) which happens very early in pyro and numpyro but also happens in pymc3 with non-negative factorisation models that I am working on.

When I replaced exp with softplus in the numpyro translation of our model that improved the stability by a lot but it is still not as stable as pymc3.

rho=0 is sigma=0.69 which is fairly similar to both 1 in PyMC-TFP and in Pyro. This makes me wonder whether there are more informative ways of initialising sigma - for example sigma^2 = abs(mu). Do you know of any relevant discussions or literature?

On the issue of softplus, is there a motivation to use exp rather than softplus when converting posterior samples from unconstrained to positive?

I am trying to follow @Sayam753 suggestion to change obj_n_mc in ADVI().fit(), however, I can’t seem to find where the default values are specified, is it done here?

For MeanField ADVI, here is the exact mention where rho is initialized.

Here is one way of initialising parameters, using inf_kwargs –

with model:
    flat_shape = pm.ADVI().approx.bij.ordering.size
    trace2 =
        'params': {'rho': theano.shared(np.ones(flat_shape)), 'mu': theano.shared(np.ones(flat_shape))}

Do you mean here, to generate sigma values depending on mu at each step of iteration or just treating as the initial values?

pm.variational.opvi.Group could be a better way to initialize parameters. But I am not much familiar with its API.
From literature point of view, initialising rho to ones, is a better assumption because the correlation of parameters with themselves will be one. I think some initial values can be troublesome leading to NaNs or high orders of magnitudes of ELBO, at first few hundred iterations.

Exp grows too fast. Softplus better positively constrains the values because of using both log and exp. I see this as a good motivation.

1 Like

Thanks again! That’s very informative.

Here is one way of initialising parameters, using inf_kwargs

I see how pm.variational.opvi.Group would be a better way to initialise because currently, you don’t know where in the flattened parameters each variable is represented. I found elsewhere on the discourse that means can be initialised by providing test_value when defining distribution - I was using that so far.

Do you mean here, to generate sigma values depending on mu at each step of iteration or just treating as the initial values?

Yes, I was suggesting to initialise sigma at more informative values, not the dependence during training. Not sure I understand why rho=1 should give a correlation of 1, could you please clarify which variables are being correlated? Also, \sigma^2 = softmax(1) ^ 2 = 1.7 - but I thought that you want to aim for \sigma^2=1.

Exp grows too fast.

That’s exactly my thinking about softmax vs exp, however, pymc3 uses exp to convert unconstrained samples from Mean Field approximation to positive scale (for example, when Gamma distribution is approximated by Mean Field). So I am wondering what is the reason behind that decision.

For MeanField ADVI, here is the exact mention where rho is initialized.

Thanks! I mean I am struggling to find the default value for obj_n_mc not rho.