Vanilla implementation of Gaussian Process in PyMC3?

I see that PyMC3 has a high level Gaussian Process (GP) API. GPs are quite abstract and so I’d like to get more familiarity with them before letting the API do all the thinking for me.

I stumbled on this blog which implements GPs in python from scratch, which has been very helpful. The author optimizes the log marginal likelihood for Kernel hyperparameter inference. Because he optimizes a point estimate, this seems more aligned with a Frequentist approach to GPs than a Bayesian one.

Broadly speaking, how does PyMC3 utilize HMC in GP? Is it sampling the GP hyperparameters (such as L and sigma in the RBF kernel)?

Bonus points- I would love to see a simple implementation of a GP using vanilla PyMC3 (not the GP API.) If anyone has such a resource, it would be tremendously helpful!

1 Like

You should be able to copy the GP examples from the book Statistical Rethinking (Chp 14, 2nd edition) without having to use the GP API. For completeness you can then check how to use the API for the same examples in here: https://github.com/pymc-devs/resources/blob/master/Rethinking_2/Chp_14.ipynb

2 Likes

Hey, this is pretty close. The author does wake use of the gp API but perhaps someone could comment on what these methods are doing under the hood?

Those lines are doing the same as the ones highlighted here:

image

It is not so easy to pick apart the lines that you highlighted, unless you know well what is going on with GPs. In any case here is my attempt at doing it:

  1. The first Pymc line that you highlighted creates the covariance matrix generating function using the ExpQuad kernel and the inverse of the rhosq, and finally scaling it by etasq. You could write your own method to replace the ExpQuad, as it is simply creating an X*X matrix where each item is given by the squared distance between X_i, X_j, scaled by inverse of rhosq. The helper method just aids in the construction of the matrix.

  2. The second line tells PyMC that you are working with a Latent GP, as opposed to a model where the likelihood itself is multivariate Gaussian. The distinction exists because different optimizations can be done depending on whether you have GP latent or GP likelihood (but this is already an API level thing). This line creates a gp object that combines the covariance matrix generating function defined in 1., with a mean generating function (the default is to set it to zero, which is the reason you don’t see it explicitly called in this line).

  3. In the third line you simply pass in the input_data to your gp object which it uses to generate samples from the multivariate gaussian created with the GP.

It helps me to think of GPs as a specific (yet very useful way) of parametrizing a multivariate gaussian. The most challenging part is how to parametrize the covariance matrix, and the different kernel and scaling factors do exactly this. It may be helpful to compare it with other models using Multivariate Gaussians that are not GP (they appear before in that same chapter).

The reference docs will also become more understandable with time (at least for me they did!): https://docs.pymc.io/Gaussian_Processes.html

2 Likes

Hi there, thanks- this was pretty helpful!

Don’t know if you had the chance to skim through the blog post I linked, but I think I’ve got a pretty decent grasp of the mechanics of GPs from a ~frequentist perspective. The author arrives at the optimal Kernel hyperparamters (L and sigma for the ExpQuad covariance function) by optimization of the log marginal likelihood.

From the SR code snippet and your explanation, I can see that the author chose to supply a Poisson likelihood and correspondingly selected the Latent GP. I have a few thoughts/questions to this end:

  • How is the Bayesian/PyMC3 approach different than the frequentist one? (It seems that rather than optimizing kernel hyperparameters, they’re sampled via HMC.)
  • How does the decision logic to supply your own likelihood (and use the Latent GP) work? (I haven’t come across this idea before.)

Big picture-I’m decently familiar with GPs; the Bayesian/PyMC3 connection seems to be where I’m most confused as log marginal likelihood optimization has either been abstracted away, replaced with something else (say sampling) and in some cases, the likelihood isn’t necessary at all (which is new for me.)

I didn’t have the chance to read the blog you linked to.

The difference you are hinting at here is the difference between maximization (find the most likely value) and integration (find the most likely values, weighted by the prior over it’s entire range, which indeed the HMC tries to approximate). Now GPs are in their nature a Bayesian construct, so I am not sure how frequentistic the approach in the blog really is.

The logic is very simple, if your observed outcome data can be modeled with a multivariate gaussian you can (but might not want to) use a GP likelihood, otherwise you must necessarily be in the latent case.

Latent GPs are used to get latent parameters which will somehow end up influencing your likelihood, such as the intercepts or slopes in a hierarchical GLM model.

The book example above uses a latent GP to get an island specific intercept (which depends on distance data) in a Poisson GLM. Obviously a GP likelihood is out of the question since the data can only be positive.

It should also possible to have a latent and likelihood GP in the same model, but I don’t know how common this is.

1 Like

Thanks, very helpful!

Hello, I’m reopening this topic because I have a question related to the PyMC implementation of Stats Rethinking here: https://github.com/pymc-devs/pymc-resources/blob/main/Rethinking_2/Chp_14.ipynb.

If I understand the API correctly, PyMC’s implementation of the GP model should be calculating the “distances” automatically based on X. Correct? If yes, I’m not sure why the distance matrix is being passed into gp.prior:

gp.prior("k", X=Dmat)

Moreover, since the island data is using a regular coordinate system, wouldn’t we want to set input_dim equal to 2? Simultaneously, we’d want to replace the distance matrix with a matrix with shape (N, 2) where each row represents the location of the island on the x,y-coordinate system.

The way it’s currently implemented, aren’t we generating the prior based on distances between distances, that is, the distances between the distances between the first island and all the other islands?

What am I missing here?

This is a style choice so that GPs are written in PyMC the way they are written in math. If you read a GP paper, one doesn’t write something like: y = f(k_1(x, x^\prime), k_2(x, x^\prime), ...), one instead directly writes y = f(x, x^\prime), abstracting away the specific kernel functions (or functions thereof). @bwengals is more qualified to give a more detailed explanation of what choices were made and why, as he wrote the gp module :slight_smile:

Yes, I believe this is true. Can you re-do the analysis the correct way and check? We’re working on adding an upgraded version of these notebooks in this PR (here is the updated GP lecture). If there is a big error, it would be good to fix it before we merge.

Hey Jesse, thanks for your response! I can try to re-do the analysis. (I’m a little new to the world of open-source—but encouraged by your interview with @AlexAndorra to give it a go :slight_smile: )

3 Likes
  • The PyMC GP API doesn’t take distance matrices directly. To reproduce the Rethinking example, I would recommend bypassing the GP API. If you check the source code for pm.gp.Latent, you’ll see it’s really not doing much. Feeding the distance matrix into pm.gp.Latent.prior as X is incorrect, its not supposed to be some distances of distances thing, it’s just over distance! The PyMC GP API is set up the way it is to make predicting easy.
  • The notation for a GP is usually f(x) \sim N(0, K(x, x')). So the GP can be evaluated over any x, but the covariance function K(x, x’) is a over all pair of x.
  • When you have a Gaussian likelihood and your model is:
\begin{align} f_i &\sim \text{MvN}(0, k(x_i, x')) \\ y_i &= N(f_i, \sigma^2) \end{align}

You can marginalize out f so you don’t have to sample it. \int p(y | f) p(f ) df. The result is

\mathbf{y} \sim \text{MvN}(0, \mathbf{K} + \sigma^2 \mathbf{I})

So these two models are equivalent (you should just say you have a GP with a normal likelihood either way), but the latter is a trick you can apply so you don’t need to sample the GP itself. If you do this, you can also use optimization to find good values for the covariance function hyperparameters. If you put priors on them, you’d call it MAP estimation. If you don’t, you’d call it type 2 maximum likelihood, or “maximizing the marginal likelihood”.

1 Like

Could you glance at the code in the new rethinking PR I linked and confirm it’s doing the wrong thing?

Hey Jesse, I reran the code code by manually defining the mean function and the covariance function, but for some reason, I kept running into this error:

ValueError: The input matrix must be symmetric positive semidefinite.
Apply node that caused the error: multivariate_normal_rv{"(n),(n,n)->(n)"}(RNG(<Generator(PCG64) at 0x31AB40740>), [], [0. 0. 0. ... 0. 0. 0.], Composite{((i2 * exp((-1.0 * i0 * i1))) + i3)}.0)
Toposort index: 5
Inputs types: [RandomGeneratorType, TensorType(int64, shape=(0,)), TensorType(float64, shape=(10,)), TensorType(float64, shape=(10, 10))]
Inputs shapes: ['No shapes', (0,), (10,), (10, 10)]
Inputs strides: ['No strides', (0,), (8,), (80, 8)]
Inputs values: [Generator(PCG64) at 0x31AB40740, array([], dtype=int64), 'not shown', 'not shown']
Outputs clients: [[output[7](multivariate_normal_rv{"(n),(n,n)->(n)"}.0)], [output[1](alpha), Composite{exp((i0 + i1))}(ExpandDims{axis=0}.0, alpha)]]

The only way to sample without running into errors was to increase the amount of jitter in the covariance matrix:

cov = (eta_squared * pm.math.exp( -(rho_squared * ISLAND_DISTANCES ** 2) ) + (pt.eye(ISLAND_COUNT) * 1e-2)) 

Here’s my final model definition (for the section that incorporates population size):

with pm.Model(coords=coords) as distance_population_model:

    population = pm.MutableData("log_population", KLINE.logpop.values.astype(float))
    CULTURE_ID_ = pm.MutableData("CULTURE_ID", CULTURE_ID)

    # Priors
    alpha_bar = pm.Exponential("alpha_bar", 1)
    gamma = pm.Exponential("gamma", 1)
    beta = pm.Exponential("beta", 1)
    eta_squared = pm.Exponential("eta_squared", 2)
    rho_squared = pm.Exponential("rho_squared", 0.5)

    # Gaussian Process
    mu = pm.math.zeros(ISLAND_COUNT)
    cov = (eta_squared * pm.math.exp( -(rho_squared * ISLAND_DISTANCES ** 2) )
           + (pt.eye(ISLAND_COUNT) * 1e-2))
    alpha = pm.MvNormal('alpha', mu = mu, cov = cov, dims = 'culture')

    # Likelihood
    lambda_T = (alpha_bar / gamma * population[CULTURE_ID_] ** beta) * pm.math.exp(
        alpha[CULTURE_ID_]
    )
    pm.Poisson("T", lambda_T, observed=TOOLS, dims="culture")
  1. Do you know why I’m getting the error?
  2. Is it okay for me to fix the error by adding any constant c to each element in the diagonal of the covariance matrix?

yup, it looks like its not doing it right. I left a comment at the spot.

It really should be PSD. The reason adding a small diagonal is done is because while the kernel may be positive definite on paper, it’s off numerically a tiny bit. The added bit to the diagonal makes is PSD on the computer too. If you have to add a larger value that early always means something is wrong. The magic number that nearly every GP library uses is 1e-6.

Did you modify ISLAND DISTANCES at all?

I’m able to run this without issues:

import pandas as pd
from io import StringIO
import scipy as sp
import numpy as np

D = pd.read_csv(StringIO(
"""Malekula;Tikopia;Santa Cruz;Yap;Lau Fiji;Trobriand;Chuuk;Manus;Tonga;Hawaii
0.0;0.475;0.631;4.363;1.234;2.036;3.178;2.794;1.86;5.678
0.475;0.0;0.315;4.173;1.236;2.007;2.877;2.67;1.965;5.283
0.631;0.315;0.0;3.859;1.55;1.708;2.588;2.356;2.279;5.401
4.363;4.173;3.859;0.0;5.391;2.462;1.555;1.616;6.136;7.178
1.234;1.236;1.55;5.391;0.0;3.219;4.027;3.906;0.763;4.884
2.036;2.007;1.708;2.462;3.219;0.0;1.801;0.85;3.893;6.653
3.178;2.877;2.588;1.555;4.027;1.801;0.0;1.213;4.789;5.787
2.794;2.67;2.356;1.616;3.906;0.85;1.213;0.0;4.622;6.722
1.86;1.965;2.279;6.136;0.763;3.893;4.789;4.622;0.0;5.037
5.678;5.283;5.401;7.178;4.884;6.653;5.787;6.722;5.037;0.0"""), sep=";").values

K = np.exp(-D**2)
sp.linalg.cholesky(K + 1e-6 * np.eye(D.shape[0]))
1 Like

I would recommend we use pm.gp.cov.ExpQuad.full_from_distance to do the computation, with a note that this is off-label usage not evaluated by the FDA. Passing a distance matrix in the first place is not the “recommend” way anyway; at least this way we’re still showing where the covariance functions are in the GP library.

1 Like

I see. I didn’t modify the distances data, but I wonder if it’s either because:

  1. I’m using the covariance matrix instead of the Cholesky decomposition, or
  2. I’m not calculating the covariance matrix correctly (I’m doing it manually, based on what I saw from the PyMC source code… It’s highly likely that I made some mistake while trying to interpret it :sweat_smile:).

Could you show me how you’re then using the decomposition to build the GP?