I am looking into ways to find a “sensible” value for the number of iterations required to approximate a posterior using variational inference (ADVI). Very much related topic is Justification for ADVI convergence criterion?.

I used Introduction to Variational Inference with PyMC — PyMC example gallery to guide me while learning about / trying out this approach.

I am (deliberately) using a simple linear regression model to understand the convergence behaviour of the ADVI optimisation.

I consider two model variants, one using a *full batch* and the other a *mini batch* approach.

I am comparing the “performance”, i.e. wall-time until convergence, for both approaches.

This is done for two variants of the problem, one where there is “little” data (n=1000 samples) and one where there is lots of data (n=1_000_000) samples.

```
import time
from pymc.variational.callbacks import Tracker, CheckParametersConvergence
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc.testing
def generate_data(num_samples: int) -> pd.DataFrame:
rng = np.random.default_rng(seed=42)
beta = 1.0
sigma = 10.0
x = rng.normal(loc=0.0, scale=1.0, size=num_samples)
y = beta * x + sigma * rng.normal(size=num_samples)
return pd.DataFrame({"x": x, "y": y})
def make_model(frame: pd.DataFrame) -> pm.Model:
with pm.Model() as model:
# Data
x = pm.Data("x", frame["x"])
y = pm.Data("y", frame["y"])
# Prior
beta = pm.Normal("beta", sigma=10.0)
sigma = pm.HalfNormal("sigma", sigma=20.0)
# Linear model
mu = beta * x
# Likelihood
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
return model
def make_model_minibatch(frame: pd.DataFrame) -> pm.Model:
with pm.Model() as model:
# Data
x, y = pm.Minibatch(frame["x"], frame["y"], batch_size=10)
# Prior
beta = pm.Normal("beta", sigma=10.0)
sigma = pm.HalfNormal("sigma", sigma=20.0)
# Linear model
mu = beta * x
# Likelihood
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y, total_size=len(frame))
return model
if __name__ == "__main__":
frame = generate_data(num_samples=1_000_000)
model = make_model(frame)
with model:
advi = pm.ADVI()
tracker = Tracker(
mean=advi.approx.mean.eval,
std=advi.approx.std.eval
)
t0 = time.time()
approx = pm.fit(
n=1_000_000,
method=advi,
obj_optimizer=pymc.adam(),
callbacks=[
CheckParametersConvergence(diff="relative", tolerance=1e-3),
tracker
]
)
t = time.time() - t0
print(f"Time for fit is {t:.3f}s.")
fig = plt.figure()
mu_ax = fig.add_subplot(221)
std_ax = fig.add_subplot(222)
hist_ax = fig.add_subplot(212)
mu_ax.plot(tracker["mean"])
mu_ax.set_title("Mean track")
std_ax.plot(tracker["std"])
std_ax.set_title("Std track")
hist_ax.plot(advi.hist)
hist_ax.set_title("Negative ELBO track")
model = make_model_minibatch(frame)
with model:
advi = pm.ADVI()
tracker = Tracker(
mean=advi.approx.mean.eval,
std=advi.approx.std.eval
)
t0 = time.time()
approx = pm.fit(
n=1_000_000,
method=advi,
obj_optimizer=pymc.adam(),
callbacks=[
CheckParametersConvergence(diff="relative", tolerance=1e-3),
tracker
]
)
t = time.time() - t0
print(f"Time for fit is {t:.3f}s.")
fig = plt.figure()
mu_ax = fig.add_subplot(221)
std_ax = fig.add_subplot(222)
hist_ax = fig.add_subplot(212)
mu_ax.plot(tracker["mean"])
mu_ax.set_title("Mean track")
std_ax.plot(tracker["std"])
std_ax.set_title("Std track")
hist_ax.plot(advi.hist)
hist_ax.set_title("Negative ELBO track")
plt.show()
```

For the small sample variant the comparison in performance for the full- vs. the mini-batch variant is

In this case using the full-batch approach (top) beats the mini batch variant (bottom).

For the ful-batch variant the convergence plots for the parameters are

and the mini-batch variant is

For the mini-batch variant the parameters seem to keep “fluctuating” around a constant value, while they look to be “constant” in the full-batch case. Intuitively I would have expected so since without a learning rate decay there will remain some “limiting variance” in the parameters. (similar to the typical SGD behaviour).

It is however unclear to me how to specify a meaningful value for the tolerance in the `CheckParametersConvergence`

callback when such a “limiting variance” exists. For other (more complex) models I have observed more pronounced fluctuations while the ELBO values “seemed to have converged” already.

If I move to a large data variant then the mini-batch variant (top) indeed becomes vastly more efficient

Parameters convergence for full-batch:

Parameters convergence for mini-batch:

I can easily envision a situation in which I cannot a-priori specify a meaningful tolerance (due to the a-priori unknown “limiting variance” of the parameter fluctuations). In such a situation I will likely not be able to “stop the iteration early” and the full-batch and mini-batch approaches will both be equally inefficient.

So it seems to me that having a meaningful and robust convergence check for the optimisation is necessary in order to benefit from the potential of the mini-batch variant, especially for more complex models. But at the same time for more complex models there is no reasonable a-priori choice for the stopping tolerance, when looking at parameters convergence.

Should I be stopping based on ELBO instead?

In applications I will potentially fit the same model to varying (but similar, e.g. increasing with availability of groups etc.) datasets and I cannot afford to “fine-tune” the tolerance for every new fit in order to achieve a meaningful stopping criterium for each model anew.

I’d be glad to get any insights in how this is typically handled in application/production settings.