I see that PyMC3 has a high level Gaussian Process (GP) API. GPs are quite abstract and so I’d like to get more familiarity with them before letting the API do all the thinking for me.
I stumbled on this blog which implements GPs in python from scratch, which has been very helpful. The author optimizes the log marginal likelihood for Kernel hyperparameter inference. Because he optimizes a point estimate, this seems more aligned with a Frequentist approach to GPs than a Bayesian one.
Broadly speaking, how does PyMC3 utilize HMC in GP? Is it sampling the GP hyperparameters (such as L and sigma in the RBF kernel)?
Bonus points- I would love to see a simple implementation of a GP using vanilla PyMC3 (not the GP API.) If anyone has such a resource, it would be tremendously helpful!
Those lines are doing the same as the ones highlighted here:
It is not so easy to pick apart the lines that you highlighted, unless you know well what is going on with GPs. In any case here is my attempt at doing it:
The first Pymc line that you highlighted creates the covariance matrix generating function using the ExpQuad kernel and the inverse of the rhosq, and finally scaling it by etasq. You could write your own method to replace the ExpQuad, as it is simply creating an X*X matrix where each item is given by the squared distance between X_i, X_j, scaled by inverse of rhosq. The helper method just aids in the construction of the matrix.
The second line tells PyMC that you are working with a Latent GP, as opposed to a model where the likelihood itself is multivariate Gaussian. The distinction exists because different optimizations can be done depending on whether you have GP latent or GP likelihood (but this is already an API level thing). This line creates a gp object that combines the covariance matrix generating function defined in 1., with a mean generating function (the default is to set it to zero, which is the reason you don’t see it explicitly called in this line).
In the third line you simply pass in the input_data to your gp object which it uses to generate samples from the multivariate gaussian created with the GP.
It helps me to think of GPs as a specific (yet very useful way) of parametrizing a multivariate gaussian. The most challenging part is how to parametrize the covariance matrix, and the different kernel and scaling factors do exactly this. It may be helpful to compare it with other models using Multivariate Gaussians that are not GP (they appear before in that same chapter).
Don’t know if you had the chance to skim through the blog post I linked, but I think I’ve got a pretty decent grasp of the mechanics of GPs from a ~frequentist perspective. The author arrives at the optimal Kernel hyperparamters (L and sigma for the ExpQuad covariance function) by optimization of the log marginal likelihood.
From the SR code snippet and your explanation, I can see that the author chose to supply a Poisson likelihood and correspondingly selected the Latent GP. I have a few thoughts/questions to this end:
How is the Bayesian/PyMC3 approach different than the frequentist one? (It seems that rather than optimizing kernel hyperparameters, they’re sampled via HMC.)
How does the decision logic to supply your own likelihood (and use the Latent GP) work? (I haven’t come across this idea before.)
Big picture-I’m decently familiar with GPs; the Bayesian/PyMC3 connection seems to be where I’m most confused as log marginal likelihood optimization has either been abstracted away, replaced with something else (say sampling) and in some cases, the likelihood isn’t necessary at all (which is new for me.)
I didn’t have the chance to read the blog you linked to.
The difference you are hinting at here is the difference between maximization (find the most likely value) and integration (find the most likely values, weighted by the prior over it’s entire range, which indeed the HMC tries to approximate). Now GPs are in their nature a Bayesian construct, so I am not sure how frequentistic the approach in the blog really is.
The logic is very simple, if your observed outcome data can be modeled with a multivariate gaussian you can (but might not want to) use a GP likelihood, otherwise you must necessarily be in the latent case.
Latent GPs are used to get latent parameters which will somehow end up influencing your likelihood, such as the intercepts or slopes in a hierarchical GLM model.
The book example above uses a latent GP to get an island specific intercept (which depends on distance data) in a Poisson GLM. Obviously a GP likelihood is out of the question since the data can only be positive.
It should also possible to have a latent and likelihood GP in the same model, but I don’t know how common this is.