# Kernel hyper-parameters priors and their use in MAP estimates of Gaussian Processes

Hi,

When we call pm.find_MAP() on a marginal likelihood implementation of GPs (e.g. here) I’m assuming, as per Bayesian Data Analysis, 3rd Ed. page 504 that the marginal likelihood is first maximized, then the (log) prior is added and then the maximum of the of posterior found?

I’ve read however that the prior for marginal likelihood models just provides starting points for the marginal likelihood maximimization to get around its non-convex nature.

Could you tell me what’s going on when I call find_MAP on GPs?

Rob

I am not sure if I follow “marginal likelihood is first maximized, then the (log) prior is added and then the maximum of the of posterior found” but as far as i know, find_MAP() minimizes the negative joint log likelihood of your model.

To be clear, say you have a model like the one given below:


with pm.Model() as model:
ℓ = pm.Gamma("ℓ", alpha=2, beta=1)
η = pm.HalfCauchy("η", beta=5)

cov = η**2 * pm.gp.cov.Matern52(1, ℓ)
gp = pm.gp.Marginal(cov_func=cov)

f = gp.marginal_likelihood("f", X=X, y=y)

map_estimates = pm.find_MAP()


This will find the joint log likelihood of all the continuous variables in your model and minimize the negative of that log likelihood. In this example, the joint log likelihood (\log\left(P(l, \eta, f \mid y)\right) = \log(P(l)) + \log(P(\eta)) + \log(P(f \mid y))) is the sum of log likelihoods of pm.Gamma, pm.HalfCauchy and pm.MvNormal (gp’s marginal likelihood).

This log likelihood function is a very high dimensional non-convex optimization problem but methods like L-BFSG-B work our pretty well on small to medium sized data.

I’ve read however that the prior for marginal likelihood models just provides starting points for the marginal likelihood maximimization to get around its non-convex nature.

You can do this in PyMC3 by either providing testval to the distribution or start parameter in the find_MAP() function. Here’s an example


with pm.Model() as model:
ℓ = pm.Gamma("ℓ", alpha=2, beta=1, testval=0.7)
η = pm.HalfCauchy("η", beta=5, testval=10.0)

cov = η**2 * pm.gp.cov.Matern52(1, ℓ)
gp = pm.gp.Marginal(cov_func=cov)

f = gp.marginal_likelihood("f", X=X, y=y, testval=1.0)

map_estimates = pm.find_MAP()


or using start parameter in find_MAP function as shown below

map_estimates = pm.find_MAP(start={'ℓ': 0.7, 'η': 10.0, 'f': 1.0})

2 Likes

Thanks I really appreciate you replying.

It’s good to know it’s not automatically doing the latter optimization thing by default. But to be clear about what I was asking in the first part:

The marginal likelihood of the GP model is (1) where f are the mean values of the GP which we integrate out, y are the independent variable we’re modelling, and X are the predictor values, \theta are the kernel hyper-parameters (l and \eta in your example):

p(y\left|X, \theta)= \int p(y | f, X, \theta) p(f | X, \theta) d f\right.\ \ \ \ (1)

This is an Multivariate Normal. This means the posterior for the hyper-parameters \theta is:

p(\theta | y, X) \propto p(y | X, \theta)p(\theta) \ \ \ \ (2)

So my question is does MAP maximize (or minimize the negative log ) of (2)?

By comparing what you wrote with what I’ve written I think yes.

Oh, now I follow. You are right! find_MAP does consider the conditional p(y \mid \theta, X) too and thus optimize for the values of hyper parameters \theta.

Great thanks.

I think my confusion came from many different sources saying the parameters come from maximizing the marginal likelihood which I assumed was without priors (which would be type II maximum likelihood according to Rasmussen and Williams). But I guess I didn’t read closely enough and including priors to maximize the posterior was always assumed.

Note that find_MAP maximized re the posterior distribution, but not the posterior marginal distribution - the two are different. There are a bit more discussion on this in this threat: Gaussian Process Regression level 1 inference: Re-producing Mauna Loa CO2 Example with PyMC3

2 Likes

Thanks for this.