Variational inference: diagnosing convergence

I am looking into ways to find a “sensible” value for the number of iterations required to approximate a posterior using variational inference (ADVI). Very much related topic is Justification for ADVI convergence criterion?.

I used Introduction to Variational Inference with PyMC — PyMC example gallery to guide me while learning about / trying out this approach.

I am (deliberately) using a simple linear regression model to understand the convergence behaviour of the ADVI optimisation.

I consider two model variants, one using a full batch and the other a mini batch approach.
I am comparing the “performance”, i.e. wall-time until convergence, for both approaches.
This is done for two variants of the problem, one where there is “little” data (n=1000 samples) and one where there is lots of data (n=1_000_000) samples.


import time

from pymc.variational.callbacks import Tracker, CheckParametersConvergence

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc.testing

def generate_data(num_samples: int) -> pd.DataFrame:
    rng = np.random.default_rng(seed=42)
    beta = 1.0
    sigma = 10.0
    x = rng.normal(loc=0.0, scale=1.0, size=num_samples)
    y = beta * x + sigma * rng.normal(size=num_samples)
    return pd.DataFrame({"x": x, "y": y})


def make_model(frame: pd.DataFrame) -> pm.Model:
    with pm.Model() as model:
        # Data
        x = pm.Data("x", frame["x"])
        y = pm.Data("y", frame["y"])

        # Prior
        beta = pm.Normal("beta", sigma=10.0)
        sigma = pm.HalfNormal("sigma", sigma=20.0)

        # Linear model
        mu = beta * x

        # Likelihood
        pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
    return model


def make_model_minibatch(frame: pd.DataFrame) -> pm.Model:
    with pm.Model() as model:
        # Data
        x, y = pm.Minibatch(frame["x"], frame["y"], batch_size=10)

        # Prior
        beta = pm.Normal("beta", sigma=10.0)
        sigma = pm.HalfNormal("sigma", sigma=20.0)

        # Linear model
        mu = beta * x

        # Likelihood
        pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y, total_size=len(frame))
    return model

if __name__ == "__main__":
    frame = generate_data(num_samples=1_000_000)
    model = make_model(frame)
    with model:
        advi = pm.ADVI()
        tracker = Tracker(
            mean=advi.approx.mean.eval,
            std=advi.approx.std.eval
        )
        t0 = time.time()
        approx = pm.fit(
            n=1_000_000,
            method=advi,
            obj_optimizer=pymc.adam(),
            callbacks=[
                CheckParametersConvergence(diff="relative", tolerance=1e-3),
                tracker
            ]
        )
        t = time.time() - t0
        print(f"Time for fit is {t:.3f}s.")
    fig = plt.figure()
    mu_ax = fig.add_subplot(221)
    std_ax = fig.add_subplot(222)
    hist_ax = fig.add_subplot(212)
    mu_ax.plot(tracker["mean"])
    mu_ax.set_title("Mean track")
    std_ax.plot(tracker["std"])
    std_ax.set_title("Std track")
    hist_ax.plot(advi.hist)
    hist_ax.set_title("Negative ELBO track")

    model = make_model_minibatch(frame)
    with model:
        advi = pm.ADVI()
        tracker = Tracker(
            mean=advi.approx.mean.eval,
            std=advi.approx.std.eval
        )
        t0 = time.time()
        approx = pm.fit(
            n=1_000_000,
            method=advi,
            obj_optimizer=pymc.adam(),
            callbacks=[
                CheckParametersConvergence(diff="relative", tolerance=1e-3),
                tracker
            ]
        )
        t = time.time() - t0
        print(f"Time for fit is {t:.3f}s.")
    fig = plt.figure()
    mu_ax = fig.add_subplot(221)
    std_ax = fig.add_subplot(222)
    hist_ax = fig.add_subplot(212)
    mu_ax.plot(tracker["mean"])
    mu_ax.set_title("Mean track")
    std_ax.plot(tracker["std"])
    std_ax.set_title("Std track")
    hist_ax.plot(advi.hist)
    hist_ax.set_title("Negative ELBO track")

    plt.show()

For the small sample variant the comparison in performance for the full- vs. the mini-batch variant is

In this case using the full-batch approach (top) beats the mini batch variant (bottom).

For the ful-batch variant the convergence plots for the parameters are
fullbatch_advi_parameters_convergence

and the mini-batch variant is
minibatch_advi_parameters_convergence

For the mini-batch variant the parameters seem to keep “fluctuating” around a constant value, while they look to be “constant” in the full-batch case. Intuitively I would have expected so since without a learning rate decay there will remain some “limiting variance” in the parameters. (similar to the typical SGD behaviour).

It is however unclear to me how to specify a meaningful value for the tolerance in the CheckParametersConvergence callback when such a “limiting variance” exists. For other (more complex) models I have observed more pronounced fluctuations while the ELBO values “seemed to have converged” already.

If I move to a large data variant then the mini-batch variant (top) indeed becomes vastly more efficient

Parameters convergence for full-batch:
fullbatch_advi_parameters_convergence

Parameters convergence for mini-batch:
minibatch_advi_parameters_convergence

I can easily envision a situation in which I cannot a-priori specify a meaningful tolerance (due to the a-priori unknown “limiting variance” of the parameter fluctuations). In such a situation I will likely not be able to “stop the iteration early” and the full-batch and mini-batch approaches will both be equally inefficient.

So it seems to me that having a meaningful and robust convergence check for the optimisation is necessary in order to benefit from the potential of the mini-batch variant, especially for more complex models. But at the same time for more complex models there is no reasonable a-priori choice for the stopping tolerance, when looking at parameters convergence.

Should I be stopping based on ELBO instead?

In applications I will potentially fit the same model to varying (but similar, e.g. increasing with availability of groups etc.) datasets and I cannot afford to “fine-tune” the tolerance for every new fit in order to achieve a meaningful stopping criterium for each model anew.

I’d be glad to get any insights in how this is typically handled in application/production settings.

1 Like

Thanks for sharing this. I have been playing with minibatch for a very large national dataset and I was wondering on how to deal with convergence.

Slightly tangential, but learning rate schedulers for ADVI have a PR that is just waiting for someone to come along and get it back on track :slight_smile:

Some more general comments:

  • Convergence detection when doing SGD is a pretty complex subject in the larger ML world, no? I think the typical approach in big-data regimes is to use a holdout set, and terminate training when the test loss starts to U-turn.
  • The PyMC “default” convergence check that you get when you do init="advi+adapt_diag is given here, as:
   cb = [
       pm.callbacks.CheckParametersConvergence(tolerance=1e-2, diff="absolute"),
       pm.callbacks.CheckParametersConvergence(tolerance=1e-2, diff="relative"),
   ]

Rightly or wrongly, I’ve viewed this as a reasonable “prior” on how to check for convergence.

@jessegrabowski thanks a lot for the quick reply and the link to the learning rate scheduler PR.

My thoughts on your two comments are below:

  • Yes, your absolutely right. This is something that has always bothered me a bit, I was hoping that for ADVI there would be a silver bullet that I just wasn’t aware of. Obviously not. Regarding the holdout data set. How would one evaluate on a holdout data set during an ADVI iteration? Is that even technically possible with the current API? Also this strikes me as a bit of a mismatch to the “Bayesian approach” where I use all the data to arrive at a posterior.

  • Thanks for mentioning the default convergence check, that is implied by option “advi+adapt_diag”

Setting aside the convergence issue. Assuming I would want to re-use the result from a previous fit when approximating the posterior (for the same model) for an enlarged data set. Is there any example that show how I can do this with pymc, i.e. use the parameters from a previous fit as the initial point for the next fit?

I found something here, How to save fitted ADVI Result? - #3 by junpenglao, is this still the recommended way to do this?

Assuming I am looking at a hierarchical model and I approximated the posterior with ADVI and now I am getting data on additional groups while the overall model structure stays the same. I’d like to re-use the approximation from my previous fit to inform the fit for the new model with additional groups, is that possible? Specifically if I have group-specific slopes with partial pooling, then I could re-use parameters for the existing groups. I probably can use the population slope to come up with a good initial guess for the parameters of the new groups so that the next fit is a lot faster?

The main problems with standard ADVI as defined in Kucukelbir et al.'s paper are:

  1. High variance of stochastic gradients—upping this to 5K or 50K helps immensely with optimization, but you’ll need a GPU to make that tractable.

  2. Bad step size adaptation—you can try a grid of step sizes in parallel, but you have to be careful to get step sizes that work in the initial iterations and the iterations nearer convergence.

  3. Bad reparameterization gradient for termination—if you’re not using a huge number of stochastic gradient iterations for the ELBO (approximate negative KL), using the stick-the-landing reparameterization gradient is more effective than the simple approach in the paper.

  4. As a termination criterion, using the ELBO can be tricky because it tends to flatten out quite a bit before actual convergence—you can see this in the OP’s plots. You can monitor the norm of the parameters themselves across iterations or use something like Wasserstein distance, but that’s super expensive.

There are a couple of good references on this:

As a side note, you can often get just as good an approximation as ADVI if not better using 100 iterations of NUTS.

I think the typical approach in big-data regimes is to use a holdout set, and terminate training when the test loss starts to U-turn.

You can’t quite do that in ML because of double descent. There’s also no natural holdout set when just given a black box target density with gradients. If you have a structured model where it makes sense to do posterior predictive inference for held out data, you could use that to mirror what the ML folks do. But then you’re not necessarily fitting the model you want to fit, but a version that’s regularized with early stopping.

4 Likes

@bob-carpenter thanks a lot for the neat summary. I will need to read the two papers, thanks for the suggestion.

Thanks for the explanation in 4. that’s quite helpful to know. Also the reference w.r.t. double descent, although this model is not “overparametrized” as far as I understand the concept.