I am looking into ways to find a “sensible” value for the number of iterations required to approximate a posterior using variational inference (ADVI). Very much related topic is Justification for ADVI convergence criterion?.
I used Introduction to Variational Inference with PyMC — PyMC example gallery to guide me while learning about / trying out this approach.
I am (deliberately) using a simple linear regression model to understand the convergence behaviour of the ADVI optimisation.
I consider two model variants, one using a full batch and the other a mini batch approach.
I am comparing the “performance”, i.e. wall-time until convergence, for both approaches.
This is done for two variants of the problem, one where there is “little” data (n=1000 samples) and one where there is lots of data (n=1_000_000) samples.
import time
from pymc.variational.callbacks import Tracker, CheckParametersConvergence
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc.testing
def generate_data(num_samples: int) -> pd.DataFrame:
rng = np.random.default_rng(seed=42)
beta = 1.0
sigma = 10.0
x = rng.normal(loc=0.0, scale=1.0, size=num_samples)
y = beta * x + sigma * rng.normal(size=num_samples)
return pd.DataFrame({"x": x, "y": y})
def make_model(frame: pd.DataFrame) -> pm.Model:
with pm.Model() as model:
# Data
x = pm.Data("x", frame["x"])
y = pm.Data("y", frame["y"])
# Prior
beta = pm.Normal("beta", sigma=10.0)
sigma = pm.HalfNormal("sigma", sigma=20.0)
# Linear model
mu = beta * x
# Likelihood
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
return model
def make_model_minibatch(frame: pd.DataFrame) -> pm.Model:
with pm.Model() as model:
# Data
x, y = pm.Minibatch(frame["x"], frame["y"], batch_size=10)
# Prior
beta = pm.Normal("beta", sigma=10.0)
sigma = pm.HalfNormal("sigma", sigma=20.0)
# Linear model
mu = beta * x
# Likelihood
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y, total_size=len(frame))
return model
if __name__ == "__main__":
frame = generate_data(num_samples=1_000_000)
model = make_model(frame)
with model:
advi = pm.ADVI()
tracker = Tracker(
mean=advi.approx.mean.eval,
std=advi.approx.std.eval
)
t0 = time.time()
approx = pm.fit(
n=1_000_000,
method=advi,
obj_optimizer=pymc.adam(),
callbacks=[
CheckParametersConvergence(diff="relative", tolerance=1e-3),
tracker
]
)
t = time.time() - t0
print(f"Time for fit is {t:.3f}s.")
fig = plt.figure()
mu_ax = fig.add_subplot(221)
std_ax = fig.add_subplot(222)
hist_ax = fig.add_subplot(212)
mu_ax.plot(tracker["mean"])
mu_ax.set_title("Mean track")
std_ax.plot(tracker["std"])
std_ax.set_title("Std track")
hist_ax.plot(advi.hist)
hist_ax.set_title("Negative ELBO track")
model = make_model_minibatch(frame)
with model:
advi = pm.ADVI()
tracker = Tracker(
mean=advi.approx.mean.eval,
std=advi.approx.std.eval
)
t0 = time.time()
approx = pm.fit(
n=1_000_000,
method=advi,
obj_optimizer=pymc.adam(),
callbacks=[
CheckParametersConvergence(diff="relative", tolerance=1e-3),
tracker
]
)
t = time.time() - t0
print(f"Time for fit is {t:.3f}s.")
fig = plt.figure()
mu_ax = fig.add_subplot(221)
std_ax = fig.add_subplot(222)
hist_ax = fig.add_subplot(212)
mu_ax.plot(tracker["mean"])
mu_ax.set_title("Mean track")
std_ax.plot(tracker["std"])
std_ax.set_title("Std track")
hist_ax.plot(advi.hist)
hist_ax.set_title("Negative ELBO track")
plt.show()
For the small sample variant the comparison in performance for the full- vs. the mini-batch variant is
In this case using the full-batch approach (top) beats the mini batch variant (bottom).
For the ful-batch variant the convergence plots for the parameters are
and the mini-batch variant is
For the mini-batch variant the parameters seem to keep “fluctuating” around a constant value, while they look to be “constant” in the full-batch case. Intuitively I would have expected so since without a learning rate decay there will remain some “limiting variance” in the parameters. (similar to the typical SGD behaviour).
It is however unclear to me how to specify a meaningful value for the tolerance in the CheckParametersConvergence
callback when such a “limiting variance” exists. For other (more complex) models I have observed more pronounced fluctuations while the ELBO values “seemed to have converged” already.
If I move to a large data variant then the mini-batch variant (top) indeed becomes vastly more efficient
Parameters convergence for full-batch:
Parameters convergence for mini-batch:
I can easily envision a situation in which I cannot a-priori specify a meaningful tolerance (due to the a-priori unknown “limiting variance” of the parameter fluctuations). In such a situation I will likely not be able to “stop the iteration early” and the full-batch and mini-batch approaches will both be equally inefficient.
So it seems to me that having a meaningful and robust convergence check for the optimisation is necessary in order to benefit from the potential of the mini-batch variant, especially for more complex models. But at the same time for more complex models there is no reasonable a-priori choice for the stopping tolerance, when looking at parameters convergence.
Should I be stopping based on ELBO instead?
In applications I will potentially fit the same model to varying (but similar, e.g. increasing with availability of groups etc.) datasets and I cannot afford to “fine-tune” the tolerance for every new fit in order to achieve a meaningful stopping criterium for each model anew.
I’d be glad to get any insights in how this is typically handled in application/production settings.