Kernel hyper-parameters priors and their use in MAP estimates of Gaussian Processes

RobertArbon · June 2, 2020, 7:57pm

Hi,

When we call pm.find_MAP() on a marginal likelihood implementation of GPs (e.g. here) I’m assuming, as per Bayesian Data Analysis, 3rd Ed. page 504 that the marginal likelihood is first maximized, then the (log) prior is added and then the maximum of the of posterior found?

I’ve read however that the prior for marginal likelihood models just provides starting points for the marginal likelihood maximimization to get around its non-convex nature.

Could you tell me what’s going on when I call find_MAP on GPs?

Thanks for your time,

Rob

tirthasheshpatel · June 3, 2020, 1:33am

I am not sure if I follow “marginal likelihood is first maximized, then the (log) prior is added and then the maximum of the of posterior found” but as far as i know, find_MAP() minimizes the negative joint log likelihood of your model.

To be clear, say you have a model like the one given below:


with pm.Model() as model:
    ℓ = pm.Gamma("ℓ", alpha=2, beta=1)
    η = pm.HalfCauchy("η", beta=5)

    cov = η**2 * pm.gp.cov.Matern52(1, ℓ)
    gp = pm.gp.Marginal(cov_func=cov)

    f = gp.marginal_likelihood("f", X=X, y=y)

    map_estimates = pm.find_MAP()

This will find the joint log likelihood of all the continuous variables in your model and minimize the negative of that log likelihood. In this example, the joint log likelihood (\log\left(P(l, \eta, f \mid y)\right) = \log(P(l)) + \log(P(\eta)) + \log(P(f \mid y))) is the sum of log likelihoods of pm.Gamma, pm.HalfCauchy and pm.MvNormal (gp’s marginal likelihood).

This log likelihood function is a very high dimensional non-convex optimization problem but methods like L-BFSG-B work our pretty well on small to medium sized data.

I’ve read however that the prior for marginal likelihood models just provides starting points for the marginal likelihood maximimization to get around its non-convex nature.

You can do this in PyMC3 by either providing testval to the distribution or start parameter in the find_MAP() function. Here’s an example


with pm.Model() as model:
    ℓ = pm.Gamma("ℓ", alpha=2, beta=1, testval=0.7)
    η = pm.HalfCauchy("η", beta=5, testval=10.0)

    cov = η**2 * pm.gp.cov.Matern52(1, ℓ)
    gp = pm.gp.Marginal(cov_func=cov)

    f = gp.marginal_likelihood("f", X=X, y=y, testval=1.0)

    map_estimates = pm.find_MAP()

or using start parameter in find_MAP function as shown below

map_estimates = pm.find_MAP(start={'ℓ': 0.7, 'η': 10.0, 'f': 1.0})

RobertArbon · June 3, 2020, 9:27am

Thanks I really appreciate you replying.

It’s good to know it’s not automatically doing the latter optimization thing by default. But to be clear about what I was asking in the first part:

The marginal likelihood of the GP model is (1) where f are the mean values of the GP which we integrate out, y are the independent variable we’re modelling, and X are the predictor values, \theta are the kernel hyper-parameters (l and \eta in your example):

p(y\left|X, \theta)= \int p(y | f, X, \theta) p(f | X, \theta) d f\right.\ \ \ \ (1)

This is an Multivariate Normal. This means the posterior for the hyper-parameters \theta is:

p(\theta | y, X) \propto p(y | X, \theta)p(\theta) \ \ \ \ (2)

So my question is does MAP maximize (or minimize the negative log ) of (2)?

By comparing what you wrote with what I’ve written I think yes.

tirthasheshpatel · June 3, 2020, 1:10pm

Oh, now I follow. You are right! find_MAP does consider the conditional p(y \mid \theta, X) too and thus optimize for the values of hyper parameters \theta.

RobertArbon · June 3, 2020, 5:02pm

Great thanks.

I think my confusion came from many different sources saying the parameters come from maximizing the marginal likelihood which I assumed was without priors (which would be type II maximum likelihood according to Rasmussen and Williams). But I guess I didn’t read closely enough and including priors to maximize the posterior was always assumed.

junpenglao · June 4, 2020, 12:39pm

Note that find_MAP maximized re the posterior distribution, but not the posterior marginal distribution - the two are different. There are a bit more discussion on this in this threat: Gaussian Process Regression level 1 inference: Re-producing Mauna Loa CO2 Example with PyMC3

RobertArbon · May 18, 2022, 9:25am

Thanks for this.

Topic		Replies	Views
What makes difference find_MAP() and pm.sample() Questions	2	2357	October 17, 2021
Calculating marginal likelihood of GP Model Questions	6	1286	December 31, 2018
Gaussian Process Regression level 1 inference: Re-producing Mauna Loa CO2 Example with PyMC3 Questions	10	2209	August 16, 2017
Unconsistent logp, Evaluating the results of find_MAP on Custom Likelihood Questions	3	587	August 28, 2018
Should find_MAP() be used for gaussian process models? Questions	1	603	May 22, 2020

Kernel hyper-parameters priors and their use in MAP estimates of Gaussian Processes

Related topics