Confusion about MCMC and INLA

Hi,

I’m having some confusion about when I can use the Laplace approximation instead of MCMC for the posterior distribution. I understand that the Laplace approximation requires the assumption that the posterior distribution is normal, whereas MCMC does not. For a problem where the likelihood probabilities are normally distributed and the prior probabilities are uniformly distributed, must the posterior probabilities be of the form of (or approximate to be) normally distributed? I’m rather confused in which case it would be skewed.

best regards

CC @theo

Hey, so there’s a couple of things at play here. Your question has INLA in the title which is a bit different to the vanilla Laplace approximation. I’ll answer the question based on the Laplace approximation and then I’ll explain the differences with INLA at the end.

Laplace approximation

The Laplace approximation approximates the posterior density by fitting a multivariate Gaussian (normal) to all the parameters with a mean vector equal to the MAP (mode, minimise the negative log-posterior density) estimate and a precision matrix (inverse covariance, inverted Hessian matrix of log posterior wrt params) evaluated at the mode.

Don’t worry too much about the details but as you are fitting a Gaussian, this becomes exact when the posterior is Gaussian and a good approximation unless you have multimodal (many MCMC methods also struggle with this) or highly skewed distributions.

Richard McElreath uses the Laplace approximation in his early parts of Statistical Rethinking and I’m sure he will explain it clearer than me. If you are worried about whether it is a good approximation for your model and whether it is worth using an approximation for gains in speed, try fitting both MCMC and the Laplace approximation. If the Laplace approximation produces a similar posterior to MCMC then happy days.

Integrated nested Laplace approximation (INLA)

With INLA, we approximate the marginal posterior distribution of some subset of the parameters, referred to as the marginal Laplace approximation.

Then, integrate out the remaining parameters using another method.

  • Integrated. Using numerical integration
  • Nested. Because we need p(θ∣y) to get p(u∣y)
  • Laplace approximations. The methods used to obtain parameters for the Gaussian approximation.

So the difference is we only use Laplace approximation on some parameters, which we hope have a Gaussian posterior, and use some other, more flexible inference method (like numerical integration in the case of R-INLA) for the other parameters.

As in the R-INLA docs, INLA works well when the “full conditional density for the latent field to be near Gaussian”, i.e. the posterior density for the subset of the parameters we want to approximate with the Laplace is Gaussian.

PyMC work on INLA is currently paused (we need a pytensor optimiser and I need to sort my life out) but we are very, very open for contributions on this if people are interested in having this in PyMC. It’s not that far off and the issue tracker is here.

I have some notes on INLA with further reading at the bottom, I also recommend Adam Howes’ thesis chapter on this which builds up from these ideas from scratch and compares to MCMC.

2 Likes

The Laplace approximation is just a second-order Taylor expansion. You approximate a distribution with a multivariate normal distribution centered at the posterior mode with covariance given by the negative inverse Hessian. The gradient term drops out because you’re at a mode.

Many Bayesian models, particularly hierarchical models, do not have modes. In that case, you can’t use Laplace approximation for the whole distribution because there’s no mode. But in these situations, we often can used integrated nested Laplace approximation (aka INLA).

INLA is being carried out to marginalize a density p(\alpha, \beta) to p(\alpha) = \int p(\alpha, \beta) \, \text{d}\beta. Typically, this is used for hierarchical models where \alpha are the high-level population parameters (e.g., population mean and variance) and have a known prior p(\alpha) whereas the \beta are low-level coefficients such as regression coefficients or random effects, so \alpha is low-dimensional and \beta is high-dimensional. Then we factor p(\alpha, \beta) = p(\alpha) \cdot p(\beta \mid \alpha) and the nested Laplace approximation is of p(\beta \mid \alpha). Even though the distribution of p(\beta \mid \alpha) is usually simple, everything’s conditioned on the observed data y and we’re looking at doing all of this in the posterior p(\alpha, \beta \mid y). All the detail’s on the Wikipedia page (they use \theta, x were I used \alpha, \beta): Integrated nested Laplace approximations - Wikipedia

3 Likes