Laplace approximation vs VI

Hi,

I’m looking at how VI and Laplace approximation work and they seem very similar to me.

The fit_laplace function in pymc-extras approximates the parameter space with a multivariate normal, matches the normal means to the posterior MAP and computes the variance from the inverse Hessian, thereby conserving correlations.

VI also approximates the parameters with a multivariate normal and fits it by minimizing the Kullback–Leibler divergence. It minimizes KL by minimizing the ELBO using gradients methods which will also converge to local modes.

So both methods work well when the true posterior is close to the Gaussian approximation and break down when this is not the case. Uncertainty will be biased by both methods to a certain degree.

Could you provide me with additional insight about how VI methods are superior?

Thanks for your help!

I don’t know about superior.

VI can fit different distribution families not just gaussian, but can also ignore correlation (as mean field does). They have a specific loss function (KL divergence or in practice ELBO), whereas Laplace finds the posterior mode and the hessian at that point to fit a mvnormal. It has no concept of loss/goodness of fit beyond that.

It’s known that MAP can’t represent well the posterior of a hierarchical model. In general, as the dimensionality of the posterior goes up, the mode becomes a less and less representative point, so MAP + Laplace becomes a worse and worse approximation. My understanding is these are the main issues that VI seeks to address.

Thank you very much both for your explanation, it makes more sense to me now. I think especially the challenge of hierarchical models in high dimensions makes sense!

[Edit: This isn’t quite what’s going on in all these systems like NONMEM, TMB, and lme4. See clarifying posts below.]

This can be almost arbitrarily bad. What works better for a hierarchical model is a max marginal “Laplace” approximation. That is, marginalize out the low-level parameters, optimize the high-level parameters, plug those back in to optimize the low level parameters, then lay down a second-order Taylor series approximation at that point (it’s not quite a Laplace approximation as it’s not the mode and technically the first order term will not drop out, but we’ll pretend it does here).

That is, if we have posterior p(\alpha, \beta), derive the marginal

\qquad p(\alpha) = \int_B p(\alpha, \beta) \textrm{d}\beta,

where B is the range of \beta. Then optimize that to estimate \alpha,

\qquad \alpha^* = \textrm{arg max}_\alpha \ p(\alpha).

Now plug that point estimate back in to get

\qquad p(\beta \mid \alpha^*) \propto p(\alpha^*, \beta),

and optimize again to get

\qquad \beta^* = \textrm{argmax}_\beta \ p(\beta \mid \alpha^*).

Now if you take the negative Hessian at (\alpha^*, \beta^*), that gives you a precision matrix

\qquad \Sigma^{-1} = -H(\alpha, \beta) = -\nabla_{\alpha, \beta} \nabla_{\alpha, \beta} \log p(\alpha, \beta),

that you can plug into a multivariate normal, e.g.,

\qquad (\alpha, \beta) \sim \textrm{multiNormal}((\alpha^*, \beta^*), \Sigma).

The really cool thing here is that this Hessian is often sparse, so that you can evaluate the approximation density and sample from the Laplace approximation very efficiently with only sparse-precision/vector products. If you try to invert this to get the inverse mass matrix you’d need for Hamiltonian Monte Carlo, it becomes dense again. So instead you use it as a preconditioner on the density, where it never needs to leave sparse form.

If you just take the approximation and Hessian approximation, you are basically computing the same point estimate and uncertainty as lme4 or TMB do in R.

Here’s a paper showing how to do all this and use it as an efficient full-rank preconditioner for Hamiltonian Monte Carlo sampling. Sorry for the self-plug, but this is really all Cole’s work.

1 Like

Is there a connection to INLA in that marginalize → optimize presentation? Or are they conceptually totally different?

I find INLA really confusing. They’re doing an outer quadrature (the I part for iteration) and an inner Laplace approximation (the NLA part) to approximate the marginals to target Bayesian marginal posteriors of the what I called \alpha and \beta. I don’t think I could spell out the algorithm reliably.

Charles Margossian and Steve Bronder just coded up a version of nested Laplace approximation for Stan’s latest release using the black-box Monte Carlo approach Charles developed here:

Here’s the Stan documentation:

Charles is going to add examples to the User’s Guide ASAP, but I don’t know when they’ll be available.

I wish I understood the relationship among all these tools better!

1 Like

That’s interesting because it somewhat reminds me of what NONMEM is doing. NONMEM is a specialized software to fit point estimates of hierarchical models for drug development.

My understanding is that it is using a loop where it starts from given high-level parameters and finds the mode of the low-level parameters. The first-order term drops out because the low-level parameter is conditional on the high-level parameters, it does not operate of the joint posterior. Then you integrate the gaussian analytically and move a step along the gradient of the high-level parameters.

I looked at how PyMC-extras does Laplace because I figured it would be different from NONMEM. Now you introduce a third method.

@ricardoV94 I wonder if we could do something like this as a demonstration/use-case of Normal-Normal marginalization after Allow closed form marginalization and implement `conditional` model transform by ricardoV94 · Pull Request #688 · pymc-devs/pymc-extras · GitHub

That functionality does exact marginalization not an approximation (?) like INLA, but INLA is indeed sharing the marginalize machinery (find a blanket markov, wrap the dependents and provide the logp approximation to it)

API is something like marginalize(model, var, method="INLA", Q=?). INLA calls that under the hood.

I don’t recall if our implementation is restricted to marginalizing gaussians actually. Maybe it is and it does give a proper marginalization. Perhaps @theo can tell

You’re right. I didn’t realize there was a loop in NONMEM, but it indeed works like the EM algorithm but with a MAP estimate rather than an intractable expectation.

Something like lme4 also operates iteratively. TMB, on the other hand, does the one-shot marginalization that I thought all of these systems did.

Sorry my earlier marks were wrong!