GP-LVM in pymc3

Is there a way to implement the GP-LVM in pymc3?

So these are essentially GPs in a unsupervised setting, with just high dimensional y data, one needs to learn both X and the kernel hypers using the marginal likelihood or fully bayesian inference.


Maybe someones tried it, but not that I know of. So instead of fixed data for X, you’d just put some prior on it right? Unless I’m missing something, I think other than that the code would work as normal. I’d be super curious to know if this works well.

I tried this with the sparse GPs in PyMC3 and it worked well. Here’s the paper as well as the code. Note that the paper doesn’t discuss the GP directly but uses the CAR which is a special case of the GP. In follow-up work I found that the GP version worked just as well. I’m happy to talk more about this.


Hi ckrapu, is it the fact the sparse_gp implementation does not make use of any location covariates what makes it a latent model? You still pass it the location values, right? I am also confused as to why you set y=None. It is the x values that are unknown in GPLVMs right?

My understanding is that the “latent” in GPLVM refers to the value of the Gaussian process, not to the coordinates of the indexing points (the “x”). I do indeed pass it the location values.

Edit: I was wrong about the GPLVM. I was unaware that “GPLVM” does indeed refer to the case where the observed data’s coordinates are unknown.

I tried again today but I did not manage to figure out how it works. Since there is no other resource or discussion on this forum about GPLVM in pymc3, it would be great if there was a self contained example!

Here is my humble attempt - keeping in mind I don’t know much about anything. Essentially X is now a latent variable which has a diagonal MvNormal prior. Then all I do is train a seperate GP per dimension of the iris dataset. Even though the GPs are independent, they share the same latent X, so my understanding is that this enforces a kind of dependency between them. I don’t know if this is what Lawrence 2004 meant with “The marginalised likelihood we are optimising in (1) is recognised as the product of independent Gaussian processes…”. That implies we need to multiply them somehow? Anyways:

from sklearn import datasets
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
Y = StandardScaler().fit_transform(

N, P = Y.shape

with pm.Model() as model:
    x = pm.MvNormal('x', np.zeros(2), np.eye(2), shape=(N, 2))
    for i in range(4):
        cov_func =, 0)
        gp =
        y = gp.marginal_likelihood(f'y{i}', X=x, y=Y[:, i], noise=0.01)
#     trace = pm.sample(1000, chains=1)
    inference = pm.ADVI()
    approx =, method=inference)

HMC takes way too long so I use ADVI instead, which seems to converge: elbo

I use a linear kernel because according to Lawrence 2004 this makes it equivalent to PCA. If we compare it to sklearn PCA it seems to be the same (rotation difference):

import matplotlib.pyplot as plt
plt.scatter(trace['x'].mean(0)[:, 0], trace['x'].mean(0)[:, 1],


from sklearn.decomposition import PCA
pcs = PCA().fit_transform(Y)
plt.scatter(pcs[:, 0], pcs[:, 1],


For reference I also tried implementing PCA in PyMC by trying to find both X and W. HMC is fast here and the results look OK, but there is label switching. This has been addressed in this notebook. I do wonder why the GP implementation works better? I guess we don’t enforce an orthogonal constrain on W? (But why does it still work then?)

import theano.tensor as tt
with pm.Model() as model:
    x = pm.MvNormal('x', np.zeros(2), np.eye(2), shape=(N, 2))
    w = pm.Normal('w', shape=(P, 2))
    y = pm.MvNormal('y',, w.T), cov=np.eye(P), observed=Y)
    trace = pm.sample(1000, chains=1)


Finally I also tried a nonlinear kernel. Just replaced the covariance function with con_func =, ls=.1). Result doesn’t seem much different:

I expected better seperation. I also tried it with the GPLVM implementation in GPy which looks different:

import GPy
gplvm = GPy.models.GPLVM(Y, 2, kernel=GPy.kern.RBF(2))
gplvm.kern.lengthscale = .1
gplvm.kern.variance = 1
gplvm.likelihood.variance = 1.
gplvm.optimize(messages=1, max_iters=5e4)


However if I use their Bayesian implementation we get a similar result. Maybe its a variational inference thing?

m = GPy.models.bayesian_gplvm_minibatch.BayesianGPLVMMiniBatch(Y, 2, num_inducing=30)
m.optimize(messages=1, max_iters=5e3)


Any comments would be appreciated! @bwengals @ckrapu


Apologies for my earlier comments - I was wrong about the GPLVM and have enjoyed learning more about it. It is indeed as you said with latent X and observed Y, rather than the other way around. Unfortunately, I don’t have any additional insight regarding differences in implementation between PyMC3 and other software.

1 Like

Sorry if I am late to the party but I tried implementing GPLVMs in PyMC4 and found some interesting reuslts:

Maybe because length scale of the kernel is very low. This may lead to very less variance and the resulting function will mimic the linear kernel. I tried with larger length scales and amplitudes in PyMC4 and the results change. It gives me the following seperation (that looks quite close to sklearn’s KernelPCA model):


KernelPCA Results (with RBF kernel):