I’ve been toying around with the GP module, and have 2 separate questions that I hope I could get help on.
In the docs, the RBF kernel is described as having the following parameterization:
k(x,x') = exp[-\frac{(x - x')^2}{2 l^2}]
I would read that as “compute squared Euclidean distance, then divide by 2l^2 and exponentiate with a negative sign”. However, that’s not how it is computed. Instead, the division by the lengthscale parameter happens before the Euclidean distance calculation. Assuming x is d-dimensional, the expression for the underlying calculations in pyMC would be
k(x, x') = exp \left(- \frac{1}{2} * \sum_{d=1}^{D} \| \frac{x^d - x'^d}{ l^2}\|^2\right)
Here’s a quick numeric illustration.
obs = np.array([[0,0],[1,1]]) #Euclidean distance between 2 points is sqrt(2)
ls = 2
pm.gp.cov.ExpQuad(2, ls=ls)(obs).eval().round(2) #non-diagonal value: 0.78
#replicating manually
np.exp(-0.5 * (((obs[0,:] - obs[1,:]) / ls) @ ((obs[0,:] - obs[1,:].T) / ls))) #evaluates to 0.78
#implementation based on the original formula
np.exp(-0.5 * ((obs[0,:] - obs[1,:]) @ (obs[0,:] - obs[1,:]).T) / ls) #evaluates to 0.61
#given squared distance is 2 and l2 was chose to be 2 as well,
#then this should be equivalent to np.exp(-0.5)
np.exp(-0.5) #evaluates to 0.61
So, my first question is: is that intentional? It seems that GPJax implementation is the same as pyMC’s. At the same time, STAN’s guide uses a more explicit notation in its formulas and it corresponds to the formula in pyMC docs, not the implementation.
pyMC’s implementation looks almost like a SE-ARD kernel (see, e.g. p5 of this document, where each dimension gets its own lengthscale parameter:
\operatorname{SE}-\operatorname{ARD}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\prod_{d=1}^D \sigma_d^2 \exp \left(-\frac{1}{2} \frac{\left(x_d-x_d^{\prime}\right)^2}{\ell_d^2}\right)=\sigma_f^2 \exp \left(-\frac{1}{2} \sum_{d=1}^D \frac{\left(x_d-x_d^{\prime}\right)^2}{\ell_d^2}\right)
There’s one difference - in pyMC’s version, l^2 gets inside the distance calculations. Also, I understand that the point of the SE-ARD kernel is to have different lengthscales for each dimension - which pyMC will allow passing to the RBF kernel, but it’s not a requirement. Was the design choice behind pyMC’s parameterization to enable seamless SE-ARD kernel usage? Either way, I think it would be great to understand what was intended - and I am happy to contribute to docs to help make it clearer for others.
My second question is sort of related. I was working on a model where I wanted to use GPs to estimate latent team strength over multiple seasons. Instead of one-k-encoding teams as indicator variables and representing them as dimensions in the dataset, I decided to fit a GP per team (so my data is a one-dimensional time variable). As pyMC’s kernels do not work with 3-dimensional data, I needed to code up my ExpQuad kernel equivalent, where all operations are implemented in a broadcastable manner. Here’s the relevant piece of the code:
class ExpQuad3D(pm.gp.cov.Stationary):
def __init__(self, ls=None, ls_inv=None, active_dims=None, eta=1):
super().__init__(input_dim=1, active_dims=active_dims, ls=ls, ls_inv=ls_inv)
self.eta = pt.tensor.as_tensor_variable(eta)
def square_dist(self, X, Xs):
if Xs is None:
s = pt.tensor.square(X).sum(axis=-1)
sqd = -2 * pt.tensor.matmul(
X, X.transpose(0, 2, 1)
) + pt.tensor.expand_dims(s, -2) + pt.tensor.expand_dims(s, -1)
else:
raise NotImplementedError
return pt.tensor.clip(sqd, 0.0, np.inf)
def full(self, X, Xs=None):
return self.cov_kernel(X, Xs)
def quad_kernel(self, X, Xs, ls):
return pt.tensor.exp(-0.5 * self.square_dist(X, Xs) / ls)
def cov_kernel(self, X, Xs):
K = self.quad_kernel(X, Xs, self.ls)
result, _ = pt.scan(
fn=lambda A: A + (1e-6 * pt.tensor.identity_like(A)),
sequences=K
)
return pt.tensor.mul(
result,
self.eta
)
If I fit a model that uses this kernel using Exponential priors for ls and eta, everything works well, and I can sample it with any of the samplers (I’m using blackjax). However, the moment I adjust the parameterization to be equivalent to what pyMC’s ExpQuad()
does with regards to lengthscale parameter, i.e, I change the function as follows (note how ls is now inside the square_dist function):
def quad_kernel(self, X, Xs, ls):
return pt.tensor.exp(-0.5 * self.square_dist(X / ls, Xs))
then I can no longer sample with blackjax, and I get an error that my model is not continuous. If I use pyMC’s sampler, it uses a Compound Step, where the ls parameter is estimated with a Metropolis step and the other parameters use NUTS. There are no other differences - and the priors still use Exponential
distribution. I have tried HalfNormal
and HalfCauchy
too, but that did not change anything.
What could be causing this behavior? I assume it may have to do with some pytensor
optimizations under the hood - I can provide a more complete example with data to illustrate, too.