Gaussian Process Regression level 1 inference: Re-producing Mauna Loa CO2 Example with PyMC3

Hi everyone!

I would like to re-produce the Gaussian Process Regression example from (Rasmussen, Williams (2006): Gaussian Processes for Machine Learning, p.118 ff.) with PyMC3. In this example, they build a model for the Mauna Loa CO2 dataset.

First they define four covariance functions $(k_1(x,x\prime), …, k_4(x,x\prime))$ with 11 unknown hyperparameters in total ($\theta_1, \theta_2, …, \theta_{11})$.

By substracting the empirical mean and optimizing the marginal likelihood over the parameters (NOT hyperparameters -> level 1 inference, MAP estimate) they estimate the values of the hyperparameters (see page 109 for a general description of level 1, 2 and 3 inference and page 112 for why they use level 1 inference for gaussian process regression).

I tried that in PyMC3 with the pymc3.find_MAP() function but failed to re-produce the values for the estimated hyperparameters. It is unclear to me how I can avoid assigning probability distributions to the hyperparameters $(\theta_1, \theta_2, …, \theta_{11})$ and declare them as some kind of unknown quantities instead.

I managed to reproduce the example in scikit-learn and I thought it is possible in PyMC3 as well. At the moment, however, it seems to me that PyMC3 wants me to assign prior distributions on the hyperparameters $(\theta_1, \theta_2, …, \theta_{11})$, and instead of conducting level 1 inference conduct level 2 inference.

Does anyone have an idea how I can avoid this?

Below my code (without prior distributions on the hyperparameters, so running the code results in an error):

from sklearn.datasets import fetch_mldata
data = fetch_mldata('mauna-loa-atmospheric-co2').data
X = data[:, [1]]
y = data[:, 0]

with pm.Model() as model:
  
     # Define unknown parameters (which I did not do since I do not want my parameters to follow a specific distribution)

     # Define Kernels
     k1 = theta_1 * pm.gp.cov.ExpQuad(1,theta_5)  # long term smooth rising trend
     k2 = theta_2 * pm.gp.cov.ExpQuad(1,theta_6) * \
        pm.gp.cov.WarpedInput(1, cov_func=pm.gp.cov.ExpQuad(1,theta_11),warp_func=mapping)  # seasonal component
     k3 = theta_3 * pm.gp.cov.RatQuad(1,theta_7, theta_8)  # medium term irregularity
     k4 = theta_4 * pm.gp.cov.ExpQuad(1, theta_9) + theta_10 * tt.eye(len(X))  # noise terms
     # Build final covariance function
     f_cov = k1 + k2 + k3 + k4 
     y_obs = pm.gp.GP('y_obs', cov_func=f_cov, sigma=1, observed={'X':X, 'Y':y})

map_estimate = pm.find_MAP(model=model)

I think in this context (MLE on level 1 inference, that is), the prior of $\theta_{i}$ are (implicitly) just Flat prior, so you can just assign them using either pm.Flat, or a Gaussian with a very large sigma.

1 Like

Thanks for your answer! I have to admit I am new to both Bayesian statistics and PyMC3 so I try to explain in detail.

I tried both pymc3.Flat and pymc3.Normal('theta_i', mu = 0, sd = 10) on the $\theta_i$. pymc3.Flat raised an error and pymc3.Normal() unfortuntaely did not give the expected results (still very far away from the results in the original example). I suspect the reason is the following:

In the original example they conduct level 1 inference which they describe as “the posterior over parameters” on page 109:

$$
p(\boldsymbol{w}|\boldsymbol{y},X,\boldsymbol{\theta},\mathcal{H_i}) = \frac{p(\boldsymbol{y}|X,\boldsymbol{w},\mathcal{H_i})p(\boldsymbol{w}|\boldsymbol{\theta}, \mathcal{H_i})}{p(\boldsymbol{y}|X,\boldsymbol{\theta},\mathcal{H_i})}
$$

where $\boldsymbol{w}$ are the parameters, $\boldsymbol{\theta}$ are the hyperparameters and $ \mathcal{H_i}$ are different models. In this general form they state the marginal likelihood is then given by:

$$
p(\boldsymbol{y}|X,\boldsymbol{\theta},\mathcal{H_i}) = \int p(\boldsymbol{y}|X,\boldsymbol{w},\mathcal{H_i})p(\boldsymbol{w}|\boldsymbol{\theta}, \mathcal{H_i}) d\boldsymbol{w}.
$$

Optimizing this marginal likelihood w.r.t the hyperparameters should yield the MAP estimate for the hyperparameters. Now, do I have to assign a probability distribution on the hyperparameters when conducting level 1 inference? For a given gaussian process with a given covariace function $K_f$ the marginal likelihood is (page 113):

$$
log\ p(\boldsymbol{y}|X, \boldsymbol{\theta}) = -\frac{1}{2}\boldsymbol{y}^\top K_y^{-1} \boldsymbol{y}-\frac{1}{2}log|K_y|-\frac{n}{2}log2\pi
$$

$$
K_y = K_f +\sigma_n^2 I
$$

I read this as “log probability of the data $\boldsymbol{y}$ given the inputs $X$ and the hyperparameters $\boldsymbol{\theta}$”. If I read this right, assigning any probability distributions on the $\boldsymbol{\theta_i}$ and calculating the MAP estimate would be wrong since here the $\boldsymbol{\theta_i}$ are not treated as a randomn varibales but as a deterministic variables.

Then when optimizing the marginal likelihood one gets different results.

Now I wonder whether:

  1. I got the theory right
  2. PyMC3 can support me on calculating the level 1 inference MAP estimate for a gaussian process hyperparameters in the above mentioned fashion

Many thanks in advance and a nice weekend to all of you!

On 1. Yes I think you have the theory right.

Now, do I have to assign a probability distribution on the hyperparameters when conducting level 1 inference?

Like @junpenglao said, not assigning a prior is equivalent to using a flat prior, or p(\theta) \propto 1. If you do this, you’re finding the MAP point on the posterior p(y \mid X, \theta) p(\theta) \propto p(y \mid X, \theta), which is equivalent to maximum likelihood on p(y \mid X, \theta). So the verbiage may be different, but the actual math is the same.

If I read this right, assigning any probability distributions on the \boldsymbol{\theta_i} and calculating the MAP estimate would be wrong since here the \boldsymbol{\theta_i} are not treated as a randomn varibales but as a deterministic variables.

They are treated as deterministic after the MAP is found!

One thing that R+W do that is very important is run the optimizer many times with different start points, then take the result with the highest log-probability. This signals that there may be local minima in the problem, or maybe numerical instability messes up the optimization sometimes. The optimization problem here is difficult because the hyperparameters \theta are on widely different scales. Unfortunately, find_MAP doesn’t have a random restart routine implemented.

A better option is to run either ADVI or NUTS and then compute the MAP from the estimate of the full posterior distribution. But the best option is to use the full posterior! The Stan folks have written recommendations on setting priors, there’s a section on GPs.

I hope that helps. Of course follow up if that doesn’t make sense or you’re having issues. I also want to mention that we’re doing a lot of work on GPs in PyMC3, so keep an eye out for updates.

1 Like

Thanks for your answer! This is not easy to get my head around and I am still not quite sure whether we are discussing on the same level of inference.

Let me try to summarize. In general level 1 inference is the posterior over the parameters $\boldsymbol{w}$:

$$
p(\boldsymbol{w}|\boldsymbol{y},X,\boldsymbol{\theta},\mathcal{H_i}) = \frac{p(\boldsymbol{y}|X,\boldsymbol{w},\mathcal{H_i})p(\boldsymbol{w}|\boldsymbol{\theta}, \mathcal{H_i})}{p(\boldsymbol{y}|X,\boldsymbol{\theta},\mathcal{H_i})}
$$
with marginal likelihood:
$$
p(\boldsymbol{y}|X,\boldsymbol{\theta},\mathcal{H_i}) = \int p(\boldsymbol{y}|X,\boldsymbol{w},\mathcal{H_i})p(\boldsymbol{w}|\boldsymbol{\theta}, \mathcal{H_i}) d\boldsymbol{w}.
$$

On the other hand level 2 inference over the hyperparameters $\boldsymbol{\theta}$ is:

$$
p(\boldsymbol{\theta}|\boldsymbol{y},X,\mathcal{H_i}) = \frac{p(\boldsymbol{y}|X,\boldsymbol{\theta},\mathcal{H_i})p(\boldsymbol{\theta}| \mathcal{H_i})}{p(\boldsymbol{y}|X,\mathcal{H_i})}
$$
with marginal likelihood:
$$
p(\boldsymbol{y}|X,\mathcal{H_i}) = \int p(\boldsymbol{y}|X,\boldsymbol{\theta},\mathcal{H_i})p(\boldsymbol{\theta}| \mathcal{H_i}) d\boldsymbol{\theta}.
$$

First I want to emphasize that Rasmussen and Williams state that optimizing the marginal likelihood on level 1 inference yields an approximate solution for the MAP estimate of the hyperparameters $\boldsymbol{\theta}$ (page 109).
Now when looking at the marginal likelihood on level 1 inference I see two terms: the likelihood of the data $p(\boldsymbol{y}|X,\boldsymbol{w},\mathcal{H_i})$ and the prior probability of the parameters $p(\boldsymbol{w}|\boldsymbol{\theta}, \mathcal{H_i})$.

What does the probabiity over the parameters actually mean in case of a gaussian process regression, a non-parametric model? What are the parameters and how can we asign a prior distribution on them? The authors state that “[…] it may not be immediately obvious what the parameters of the model are. Generally, one can regard the noise-free latent function values at the training inputs $\boldsymbol{f}$ as the parameters.” (page 112) And as far as I understand we assume (by applying a gaussian process regression model in the first place) that those values follow a gaussian distribution. So assigning another flat prior distribution on the $\boldsymbol{w}$ would be somehow redundant. But do we have to assign a prior on the hyperparameters $\boldsymbol{\theta}$ then? From looking at the level 1 inference marginal likelihood I cannot see a distribution over the hyperparameters $\boldsymbol{\theta}$. I can see a distribution over the parameters, but not a distribution over the hyperparameters. I could be wrong here.

However, let’s assume for now that it is not necissary to put a prior distribution on the hyperparameters $\boldsymbol{\theta}$ and see where it leads us. Then for a gaussian process regression model the level 1 marginal likelihood is:

$$
log\ p(\boldsymbol{y}|X, \boldsymbol{\theta}) = -\frac{1}{2}\boldsymbol{y}^\top K_y^{-1} \boldsymbol{y}-\frac{1}{2}log|K_y|-\frac{n}{2}log2\pi
$$

$$
K_y = K_f +\sigma_n^2 I
$$

We can optimize this w.r.t. to the hyperparameters $\boldsymbol{\theta}$ by taking the deriatives, setting those to zero and solve the equations. The derivates are given by (page 104):

$$
\frac{\partial}{\partial \theta_i} log\ p(\boldsymbol{y}|X,\boldsymbol{\theta}) = \frac{1}{2}\boldsymbol{y}^\top K^{-1}\frac{\partial K}{\partial \theta_i}K^{-1}\boldsymbol{y}-\frac{1}{2}tr\left(K^{-1}\frac{\partial K}{\partial \theta_i}\right)
$$

As a result we get the one set of hyperparameters $\boldsymbol{\theta}$ that maximizes the marginal likelihood “on level 1”.

From my understanding we did not have to put a prior distribution on the hyperparameters $\boldsymbol{\theta}$ to get this solution. Actually assigning a probability distribution on $\boldsymbol{\theta}$ automatically bring us to level 2 inference from my understanding. And to get back to my original question: It seems to me that PyMC3 does not support this special case of inference on hyperparameters of a gaussian process regression via MAP estimate. Is this correct?

Hi @max, I have a look at the page you are referring to on R+W, I think I more or less get where your confusion is - I will try to clarify below, let me know if I misunderstand you at any way:

This is not correct, the MAP estimation is the parameter \boldsymbol{w}. As in the example on page 108: “For example, the parameters (w) could be the parameters in a linear model […]. At the second level are hyperparameters θ which control the distribution of the parameters at the bottom level. For example […] the “ridge” term in ridge regression are hyperparameters.” Taking a simple regression example, Y = X*b + eps. b would be the parameters in this case, and the prior of b would be the hyperparameter θ. If you assume flat prior, you have MAP estimation = MLE = Least-square solution.

So now get to GP, \boldsymbol{w} here is non-parametric/a function. As explained in the book later (which you also quote), “one can regard the noise-free latent function values at the training inputs \boldsymbol{f} as the parameters”. So, if you want a prior for value, you need a distribution of value (takes the value as input and output probability); similarly, if you want a prior for a function, you need a distribution of function. It is hard to wrap the head around at first, you can instead think of a distribution take a lot of values at once (infinitely many, in fact) and output probability. And when you sample from a distribution of function, you get functions, which effectively what you are seeing in the doc: Samples from GP Prior.

Of course, writing a prior for a function is not straightforward (after all there are potentially infinite many parameters, or as in R+W “The more training cases there are, the more parameters.”), so instead we put prior on the hyperparameters (the parameters of the covariance function) θ, the parameters that generate the functions.

I hope this point is now more clear: inference on GP is automatically level 2, as we can not put prior on w but need to put prior on hyperparameter theta. This also brings to the pervious answer that even when we just find the MAP or MLE regarding the covariance function parameters (i.e., theta), there is a prior there (i.e., flat prior).

You can do inference via MAP estimation in PyMC3 (using find_MAP), the problem is just that in higher dimension the result will not be accurately reflecting the Posterior.

2 Likes

Hi @junpenglao, hi @bwengals, first of all thanks for not giving up on me - I really appreciate since I learn a lot in this discussion.
I do not want to overstrain and hope you also have a benefit discussing with me. Allow me to comment on a few things.

Yes of course, you are absolutely right - I have to express myself more precisely.

I am still not entirely confident with the statement that we are automatically doing level 2 inference in case of a gaussian process regression model because R+W state on the same page 109: “In practice, especially the evaluation of the integral in eq. (5.6) [level 2 marginal likelihood] may be difficult, and as an approximation one may shy away from using the hyperparameter posterior in eq. (5.5) [level 2 inference], and instead maximize the marginal likelihood in eq. (5.4) [level 1] w.r.t. the hyperparameters, \boldsymbol{\theta}. This approximation is known as type II maximum likelihood (ML-II).”
Later in chapter 5.4.1 on page 112 they “[…] proceed by applying eq. (5.3) [level 1 inference] and eq. (5.4) for the 1st level inference […]”.
What I want to show with that is, that R+W give me the impression that level 1 inference is indeed possible for a gaussian process regression model. (It might not be optimal though and more a first guess for the values of the hyperparameters which one can then take into account on level 2 inference.)

As for the objection we cannot assign a prior to the weights \boldsymbol{w}:
Aren’t we implicitly putting a prior on \boldsymbol{w} by using a gaussian process regression model in the first place? Aren’t we actually assuming that the noise-free latent function values at the training inputs arise from a realization of a gaussian process and have a joint gaussian distribution with a certain mean and a certain covariance structure (imposed by the mean and covariance function)?

No problem, I am glad you find the discussion helpful!

In R+W, page 109 is more a brief discussion of general Bayesian model selection, whereas page 112 is the application on GP.

You can, of course, perform level 1 inference. In effect, it is just writing down the covariance explicitly conditioned on the input X and estimated the covariance matrix directly. For example, in PyMC3, after you create your covariance function you can plug in the input X and get a covariance matrix.

But I think jumping directly to level 2 inference (using R+W terminology) might be easier to understand. Because even in R+W when they are referring to the level 1 inference (ie, get an estimation of the noise-free latent f), this noise-free latent f is conditioned “on the hyperparameters (the parameters of the covariance function) θ” (see eq 5.8 and the text below).

Your understanding is correct. What I was trying to express above is that, since this prior is a distribution of function, it is unavoidable you are doing level 2 inference (ie doing inference on the parameters of the distribution).

1 Like

Thanks a lot @junpenglao!
I just found this: Looking at the Keeling Curve with GPs in PyMC3 - posts and it is basically what I was trying to achive with PyMC3 with the difference that the author uses informative priors on the hyperparameters and solves via sampling. This is great help.

As R+W state on page 109: “In practice, especially the evaluation of the integral in eq. (5.6) [level 2 marginal likelihood] may be difficult, and as an approximation one may shy away from using the hyperparameter posterior in eq. (5.5) [level 2 inference], and instead maximize the marginal likelihood in eq. (5.4) [level 1] w.r.t. the hyperparameters, \boldsymbol{\theta}. This approximation is known as type II maximum likelihood (ML-II).”

My main takeaway from this discussion is that there are (amongst others) two ways of detrmining the values of the hyperparameters \boldsymbol{\theta}:

  1. MAP estimation on level 2 inference. Therefore one has to assign a prior to the parameters.
  2. Approximating this MAP estimate by optimizing the marginal likelihood of level 1 inference w.r.t. the hyperparameters (ML-II).

PyMC3s function pymc3.find_MAP() only supports the 1st way and not the second (ML-II).

oh good, I hope that post helps! The post uses find_MAP though, no sampling. but be careful, the code in that post is in a branch that’s not part of pymc3 master yet and is unstable.

Just an opinion here, but I’m not sure that drawing a strong distinction conceptually between parameters and hyperparameters is necessary or terribly helpful. From a Bayesian viewpoint, all unknown parameters are just unknown parameters. The ‘hyper’ in hyperparameter usually implies a hierarchical relationship, where a prior distribution has a dependence on another unknown parameter.

Say there is some model with an unknown parameter x. You assign a normal prior with mean zero and an unknown variance \sigma^2. Since the variance is unknown, you have to assign it a prior as well. x is called a parameter, and \sigma^2 is sometimes referred to as a hyperparameter. They could both just as well be called ‘parameters’ though. Or like @junpenglao said, the strength of the L2 penalty in ridge regression is an example (which is equivalent to a normal prior on the unknown coefficients). Or, the weights in a neural network are parameters with a dependence on the number of layers used, which is called the hyperparameter (or just call it another unknown parameter).

GPs are used as a prior over a function, not just a number or a vector. Doing this is only useful when the function itself is an unknown parameter in a model. The characteristics of the prior on the function (periodic? smooth? lots of variation?) are determined by a mean function and a covariance function. The mean function and covariance usually have unknown parameters, like a lengthscale. Parameters of the mean and covariance function like lengthscales are often called hyperparameters in books or papers talking about GPs.

#1 is certainly an option. Using MCMC or variational inference are preferable to MAP, if you can afford the extra computation. But no on #2. In the equation for level 1 inference (looking at 5.3) , \theta is given. Your other option is to do inference on p(w, \theta, \mid y, X, H). Your choice is to either integrate w out analytically or not.

1 Like

Thanks for your comment @bwengals I appreciate!

I don’t really get why #2 is not an option for two reasons. The first one is that I trust R+W who state that it is an option (with the danger of overfitting and finding local optimas though), see last third of page 109.
The second reason is from looking at what ML-II actually does. It maximizes

p(\boldsymbol{y}|X,\boldsymbol{\theta},\mathcal{H_i}) = \int p(\boldsymbol{y}|X,\boldsymbol{w},\mathcal{H_i})p(\boldsymbol{w}|\boldsymbol{\theta}, \mathcal{H_i}) d\boldsymbol{w}

w.r.t. the hyperparameters \boldsymbol{\theta}. Let me try to put that in my own words. “Maximize the probability of seeing the data, given the model \mathcal{H_i} (GP regression model with a certain mean and covariance structure), given the inputs X and the hyperparameters \boldsymbol{\theta}”. Assuming we have a training data set, we can find the paramters \boldsymbol{\theta} that do exactly that, maximize the probability of seeing that training data. I am with you saying that it is not the “finest bayesian fashion” to do so since one does not take the uncertainty regarding the hyperparameters \boldsymbol{\theta} into account. However, the approach is not totally “wrong”, right? Imagine I had no clue at all about the values of my hyperparameters. Then I would be happy to use this approach to get a first idea where those hyperparamters could be. This result I could then take on level 2 inference and get probabilistic about my hyperparamters.