Variational fit (ADVI) - initialisation

I would like to understand if (and how) I can specify the initial “guess”/parameters when computing posterior approximations via ADVI.

Consider the example of a simple linear regression below,

def generate_data(num_samples: int) -> pd.DataFrame:
    rng = np.random.default_rng(seed=42)
    beta = 1.0
    sigma = 10.0
    x = rng.normal(loc=0.0, scale=1.0, size=num_samples)
    y = beta * x + sigma * rng.normal(size=num_samples)
    return pd.DataFrame({"x": x, "y": y})


def make_model(frame: pd.DataFrame) -> pm.Model:
    with pm.Model() as model:
        # Data
        x = pm.Data("x", frame["x"])
        y = pm.Data("y", frame["y"])

        # Prior
        beta = pm.Normal("beta", sigma=10.0)
        sigma = pm.HalfNormal("sigma", sigma=20.0)

        # Linear model
        mu = beta * x

        # Likelihood
        pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
    return model

if __name__ == "__main__":
    num_samples = 10_000
    frame = generate_data(num_samples=num_samples)

    model = make_model(frame)
    with model:
        advi = pm.ADVI()
        tracker = Tracker(
            mean=advi.approx.mean.eval,
            std=advi.approx.std.eval
        )
        t0 = time.time()
        approx = pm.fit(
            n=1_000_000,
            method=advi,
            callbacks=[
                CheckParametersConvergence(diff="relative", tolerance=1e-3),
                CheckParametersConvergence(diff="absolute", tolerance=1e-3),
                tracker
            ]
        )
        t = time.time() - t0
        print(f"Time for fit is {t:.3f}s.")

The method fit has two optional parameters start and start_sigma, both of which expect a dictionary in order to initialise the fit. I would like to understand how I should set those dictionaries for the above models, i.e, what are the keys for the two dictionaries.

In the end I would like to implement a kind of “updating” scheme, e.g. fit the model “until convergence” for a number of data points, then get new samples and “re-fit” using the final parameters from the previous fit as the initial guess for the next fit.

To facilitate this I need to understand
i) How to specify start and start_sigma
ii) How can I extract get the final/converged parameters from a fitted approximation (so that I can set them as initial ones for the nex one).

I’d be really glad for any help, pointers, or suggestions. I haven’t yet found a good explanation of the intended usage of start and start_sigma in the documentation. Also using the tracker instance is my current best guess in order to obtain the final parameters when fitting an approximation.

Any help here is very much appreciated. Thanks!

I started looking into the implementation of ADVI in pymc/variational/inference.py, but I am really having a hard time introspecting the code in order to understand how to correctly set start and start_sigma - in fact I don’t see how those extra arguments are forwarded to the actual ADVI fit implementation.

@ricardoV94 @schmaus would you be able to give me some pointers here?

Hey Benjamin, start and start_sigma are arguments to pm.fit

@fonnesbeck thanks a lot for your reply. Yes, I am aware the start and start_sigma are arguments to pm.fit.

Unfortunately their description in the docs of pm.fit were not sufficient for me to understand how they need to be specified, i.e. a specific example for e.g. the model outlined above would be very helpful. I understand that they both are dictionaries of arrays, but it is not clear to me how to find the correct keys and values (or at least shapes of the arrays used as values).

In my “desperation” I started backtracking how those arguments are used in pm.fit. From what I see in the code we have
i) Add start and start_sigma to dictionary inf_kwargs
ii) ADVI(model=model, **inf_kwargs)
iii) KL(MeanField(model=model, **inf_kwargs))
iv) groups = [MeanFieldGroup(None, *args, model=model, **inf_kwargs)]
Group(groups, model=kwargs.get("model"))
I have replaced some super calls with the explicit type to clarify the nested calls. From that call-stack I wasn’t able to infer how to correctly define start and start_sigma and I was hoping that users/developers of pymc with more familiarity with this part of the code base or initialisation of ADVI in general would be able to shine some light on this.

As you are discovering, the AVDI codebase is in need of a loving refactor.

Everything ultimately gets set up deep in the bowels of the MeanFieldGroup class, here. To understand what you need to give, it is good to look at the _prepare_start function.

For optimization problems, PyMC has a class called DictToArrayBijection that lets you freely flip-flop between a dictionary with parameter name keys and n-dimensional value-array values, and a flat array of all raveled parameter arrays concatenated. The later is what is expected by e.g. scipy.optimize.minimize.

Edit: I brought up DictToArrayBijection because there is a comment in the linked method that mentions it, but staring at the code, it doesn’t appear to be required. I think standard dictionaries will be sufficient.

So what it appears you need to provide is:

  • For the mean, a dictionary of parameter_name: ndarray. These will be piped into make_initial_point_fn, which in turn will handle this bijection business for you (I think!)
  • For the sigmas, based on the start_sigmas.get here, it looks like you also want a dictionary of parameter_name: ndarray.

One note: the names you will need to provide are not necessarily the ones your expect. They will be the names you see in model.continuous_value_vars, which attaches a suffix based on the bijective transformation internally used to allow samplers/optimizers to propose values in \mathbb{R}^N, rather than in constrained subspaces. For example, if you make a variable sigma = pm.HalfNormal('sigma'), the name you will find is sigma_log__.

1 Like

Thanks a lot @jessegrabowski for your reply, looking into the code and coming up with all those valuable pointers and explanations - very much appreciated!

I will look into this and will try to reach out in case I have follow up questions. Looking forward to figuring this out.

If you call initial_point on any model, it will give you the default initial values that you can use to get the required keys and shapes for the starting values, which is helpful for initializing more complex models.

3 Likes

@jessegrabowski and @fonnesbeck: I looked into the code and I believe there are some subtle bugs (not the sampler), cf issue here.

I’d be glad if you could take a look and obviously if there are comments and questions I’d be delighted.

1 Like

The problem is that the ADVI algorithm doesn’t work very well as originally written. See:

The problem with ADVI is that it could really use (a) many more evaluations of log density and gradient for evaluation of ELBO (thousands are ideal in order to stabilize stochastic gradient), (b) the stick-the-landing reparameterization gradients unless you’re going to do even more evaluations of log density and gradient for the ELBO, and (c) better step size adaptation.

There’s also the not insubstantial problem of only having a multivariate normal approximation on the unconstrained scale or even a completely factored multivariate normal with diagonal covariance—this limits the kinds of distributions that can be modeled well (e.g., all positive-constrained parameters get a lognormal marginal posterior and all unconstrained parameters get a normal marginal posterior in both dense and diagonal ADVI versions).

@bob-carpenter thanks for highlighting those problems. I am also looking into using a deterministic variant of ADVI, cf. ENH: implement deterministic ADVI (DADVI) · Issue #7374 · pymc-devs/pymc · GitHub to hopefully mitigate some of the problems of stochastic gradiebr descent.

DADVI still uses a mean-field approach and until I have gotten around to implement it I will need to use the methods available.

I am looking at a sequence of hierarchical models with many groups (thousands) and hundred to thousands samples per group - atm HMC sampling is too slow so that I need to use VI in order to speed up inference.

In addition Models in the sequence feature an increasing number of groups, but unfortunately there doesn’t seem to be a computationally feasible way to use posterior samples from a previous model to speed up the HMC inference for the next model in the chain. With VI I can use the fitted approximation for the previous model as the initial guess for the next model, e.g. via transfer of the parameters of the mean field approximation. Does that make sense?

PyMC presumably lets you warm start HMC sampling with (a) step size, (b) mass matrix, and (c) a draw from a previous fit, but you have to be careful the parameters mean the same thing. As soon as you add more covariates, the interpretation of existing coefficients changes. Just be careful not to use posterior means to initialize as those are often bottlenecks for sampling if you start there.

The problem with deterministic ADVI is that unless your fixed sample of draws is very large, you’re going to get a ton of bias. Now that may be OK, since mean-field ADVI is giving you a ton of bias already, especially in estimating uncertainty. Or you may be trying to do what Ryan Giordano et al. did in their paper on deterministic ADVI and try to post-correct using linear adjustments. See:

I would strongly recommend testing that you are getting appropriate answers with ADVI using something like posterior predictive checks.

1 Like

@bob-carpenter the post correction capability of DADVI is one aspect that I really like and would want to apply. The paper also mentions the possibility to check whether the current sample size for the SAA approximation is sufficient, so that a successful check can be used to stop the (outer) iteration (over the sample size). But I have no practical experience and will have to see how it performs and how useful the computed approximations are.

Thanks a lot for highlighting the need for running additional checks. I have already seen that the mean field approximation has significant bias for the across group variation of e.g. slopes in a hierarchical linear model (when running on synthetic data from.a generative model). But at least the population slopes were somewhat correctly captured. Its defintively not ideal and I am already wary of how it will perform when I try to move from a linear to a nonlinear setting.