Gaussian process lengthscale parameterization / trigger of compound sampler

I’ve been toying around with the GP module, and have 2 separate questions that I hope I could get help on.

In the docs, the RBF kernel is described as having the following parameterization:

k(x,x') = exp[-\frac{(x - x')^2}{2 l^2}]

I would read that as “compute squared Euclidean distance, then divide by 2l^2 and exponentiate with a negative sign”. However, that’s not how it is computed. Instead, the division by the lengthscale parameter happens before the Euclidean distance calculation. Assuming x is d-dimensional, the expression for the underlying calculations in pyMC would be

k(x, x') = exp \left(- \frac{1}{2} * \sum_{d=1}^{D} \| \frac{x^d - x'^d}{ l^2}\|^2\right)

Here’s a quick numeric illustration.

obs = np.array([[0,0],[1,1]]) #Euclidean distance between 2 points is sqrt(2)
ls = 2
pm.gp.cov.ExpQuad(2, ls=ls)(obs).eval().round(2) #non-diagonal value: 0.78

#replicating manually
np.exp(-0.5 * (((obs[0,:] - obs[1,:]) / ls) @ ((obs[0,:] - obs[1,:].T) / ls))) #evaluates to 0.78

#implementation based on the original formula
np.exp(-0.5 * ((obs[0,:] - obs[1,:]) @ (obs[0,:] - obs[1,:]).T) / ls) #evaluates to 0.61

#given squared distance is 2 and l2 was chose to be 2 as well, 
#then this should be equivalent to np.exp(-0.5)
np.exp(-0.5) #evaluates to 0.61

So, my first question is: is that intentional? It seems that GPJax implementation is the same as pyMC’s. At the same time, STAN’s guide uses a more explicit notation in its formulas and it corresponds to the formula in pyMC docs, not the implementation.

pyMC’s implementation looks almost like a SE-ARD kernel (see, e.g. p5 of this document, where each dimension gets its own lengthscale parameter:

\operatorname{SE}-\operatorname{ARD}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\prod_{d=1}^D \sigma_d^2 \exp \left(-\frac{1}{2} \frac{\left(x_d-x_d^{\prime}\right)^2}{\ell_d^2}\right)=\sigma_f^2 \exp \left(-\frac{1}{2} \sum_{d=1}^D \frac{\left(x_d-x_d^{\prime}\right)^2}{\ell_d^2}\right)

There’s one difference - in pyMC’s version, l^2 gets inside the distance calculations. Also, I understand that the point of the SE-ARD kernel is to have different lengthscales for each dimension - which pyMC will allow passing to the RBF kernel, but it’s not a requirement. Was the design choice behind pyMC’s parameterization to enable seamless SE-ARD kernel usage? Either way, I think it would be great to understand what was intended - and I am happy to contribute to docs to help make it clearer for others.


My second question is sort of related. I was working on a model where I wanted to use GPs to estimate latent team strength over multiple seasons. Instead of one-k-encoding teams as indicator variables and representing them as dimensions in the dataset, I decided to fit a GP per team (so my data is a one-dimensional time variable). As pyMC’s kernels do not work with 3-dimensional data, I needed to code up my ExpQuad kernel equivalent, where all operations are implemented in a broadcastable manner. Here’s the relevant piece of the code:

class ExpQuad3D(pm.gp.cov.Stationary):
    def __init__(self, ls=None, ls_inv=None, active_dims=None, eta=1):
        super().__init__(input_dim=1, active_dims=active_dims, ls=ls, ls_inv=ls_inv)                
        self.eta = pt.tensor.as_tensor_variable(eta)        

    def square_dist(self, X, Xs):        
        if Xs is None:
            s = pt.tensor.square(X).sum(axis=-1)
            sqd = -2 * pt.tensor.matmul(
                X, X.transpose(0, 2, 1)
            ) + pt.tensor.expand_dims(s, -2) + pt.tensor.expand_dims(s, -1)
        else:
            raise NotImplementedError
        return pt.tensor.clip(sqd, 0.0, np.inf)
        
    def full(self, X, Xs=None):        
        return self.cov_kernel(X, Xs)
        
    def quad_kernel(self, X, Xs, ls):
        return pt.tensor.exp(-0.5 * self.square_dist(X, Xs) / ls)
        
    def cov_kernel(self, X, Xs):
        K = self.quad_kernel(X, Xs, self.ls)
        result, _ = pt.scan(
            fn=lambda A: A + (1e-6 * pt.tensor.identity_like(A)),     
            sequences=K
        )
        
        return pt.tensor.mul(
            result,
            self.eta
        )

If I fit a model that uses this kernel using Exponential priors for ls and eta, everything works well, and I can sample it with any of the samplers (I’m using blackjax). However, the moment I adjust the parameterization to be equivalent to what pyMC’s ExpQuad() does with regards to lengthscale parameter, i.e, I change the function as follows (note how ls is now inside the square_dist function):

 def quad_kernel(self, X, Xs, ls):
        return pt.tensor.exp(-0.5 * self.square_dist(X / ls, Xs))

then I can no longer sample with blackjax, and I get an error that my model is not continuous. If I use pyMC’s sampler, it uses a Compound Step, where the ls parameter is estimated with a Metropolis step and the other parameters use NUTS. There are no other differences - and the priors still use Exponential distribution. I have tried HalfNormal and HalfCauchy too, but that did not change anything.

What could be causing this behavior? I assume it may have to do with some pytensor optimizations under the hood - I can provide a more complete example with data to illustrate, too.

1 Like

CC @bwengals

First you’re right that the way it’s done in PyMC and other GP libraries (I mostly used the design of GPy and early GPFlow, which itself was based on GPy, for the kernel library) is to make ARD, or one lengthscale per input dimension, work correctly.

For intuition, maybe it helps to rewrite

k(x, x') = \exp{\left[-\frac{(x - x')^2}{2\ell^2} \right]} = \exp{\left[-\frac{1}{2} \left( \frac{x - x'}{\ell}\right)^2 \right]} = \exp{\left[-\frac{1}{2} \left( \frac{x}{\ell} - \frac{x'}{\ell}\right)^2 \right]} \,.

What the lengthscale is effectively doing is rescaling your data, or you could even say “standardizing” x and x'. Say x = 0 and x' = 100, or your data ranges from 0 to 100. A lengthscale of \ell = 100 would make the distance between your rescaled x and x' equal to one. Now say x and x' are 0 and 1, or your data ranges from 0 to 1. Now you need a lengthscale of \ell = 1 would make the distance between these rescaled x's also equal to one.

Then for ARD, or anisotropic, case, you need to do this rescaling per dimension. This is like assuming in your model that “distance” is somehow different in each direction, and so each direction needs to be rescaled independently.

An example where you maybe would want to have the same lengthscale for every dimension (isotropic) is spatial, where maybe the first dimension is latitude and the second dimension is longitude. Then, say you’re doing something spatiotemporal, there’s no reason to think your lengthscale in the time direction is going to be the same as in the two spatial directions, so you’d use a different lengthscale for time.

The way the Stan docs write the kernel with \rho (which equals the lengthscale \ell) pulled out of the sum hides this a bit more, but the formula is the same.

For your 2nd Q:

I decided to fit a GP per team (so my data is a one-dimensional time variable). As pyMC’s kernels do not work with 3-dimensional data, I needed to code up my ExpQuad kernel equivalent, where all operations are implemented in a broadcastable manner.

I think you can avoid doing this in your use case. Can you think of each team’s time series as one draw from the same GP?

f_\text{team} \sim \mathcal{GP}(0, k(x, x'))

so, the lengthscale in the time direction is the same per team? Or is the lengthscale over time different per team? I think with more info on what you’d like to model we can come up with something that works, so yeah please feel free to post a more complete example.

Then, the Exponential or HalfNormal aren’t great distributions for the lengthscale. An uninformative prior for the lengthscale would put lot’s of mass at large values. If the lengthscale is really large relative to the distances between x and x', then it will scale down the euclidean distance to be zero between all x pairs. Functions drawn from this GP distribution will be flat lines, because that’s like saying all the x’s are the same so there’s no changes in y. A better prior to use is an InverseGamma.

Thanks - the explanation on the lengthscale parameterization is really helpful!

As for my second question:

  1. I should clarify that the lengthscale prior I am using is \frac{1}{2}Exp(1) - i.e., it’s inverse Exponential, which is a special case of inverse Gamma. I haven’t even thought I could simply use InverseGamma though.
  2. I am pretty sure that the issue I faced has to do with pytensor/pymc and not the GP side of things, as it was a change in a sampler method that got triggered by a simple “maths” change - I’ll see if I can reproduce an example that doesn’t involve GPs to illustrate.

As for the model itself - I would appreciate advice! Basically, I am trying to estimate home-court advantage in the Lithuanian basketball league. I have all regular season games for 20+ seasons, where for each game I know the home team, away team and the result of the game.

The GP-based model I came up with is as follows:

  • Each team’s strength is season s is estimated via a draw from a GP fit for that team
  • Each team has a home court advantage that’s static across seasons; those are drawn in a hierarchical setup to provide partial pooling structure
  • The outcome variable is score differential (home points - away points) which is modeled as a function of home team strength, away team strength, and the home court advantage of the home team.

In maths notation:

O^{s,k}_{i,j} \sim \text{Normal}(\alpha^{s}_{i} - \alpha^{s}_{j} + β_i H^{k}_i - β_j H^{k}_j, \epsilon) \\ \alpha^{s}_{t} \sim \text{GP}_t(0, f(s, l_t, \eta_t^{2})) \\ l_t \sim \frac{1}{2}Exp(1) \\ \eta_t^{2} \sim Exp(1) \\ \bar{\beta} \sim \text{Normal}(4, 2) \\ γ \sim \text{HalfNormal}(2) \\ \beta_{t} \sim \text{Normal}(\bar{\beta}, γ) \\ \epsilon^{s} \sim \text{HalfCauchy}(5) \\

where O^{k}_{i,j} is the score differential of game k in season s between teams i and j, \alpha^{s}_{i,j} is the pair-wise strength differential between teams i and j in season s, and H^{k}_t is an indicator variable whether the team t is playing at home. The main parameters of interest are \beta_{t}.

This model works and samples well, but because I fit t GPs, I needed to go the 3-D route (and ran into the issue described above if I when my lengthscale parameterization divides the x before taking distance function than afterward).

How would you suggest to change it to one GP only?

I have considered an approach where I add team indicators as a one-of-k encoded vector to each season (so instead of 1-dimensional data, I would have 1 + t dimensional data). When I tried running it, it took a much longer to sample than I expected (as in, I did not wait for it to finish…), while the t GPs approach samples in a few minutes. I have ~30 teams and ~30 seasons in my data, with a total of ~4800 games (but GPs work on season-level data, not game-level one).

This model is just for my learning purposes, and I just published a full notebook that’s also a blog post - just in case you wanted to look through it in more detail (or learn about basketball team strengths and home court advantage levels in Lithuania! :slight_smile: ) But will be happy to do a follow-up with some other specifications!

1 Like