Need for a review of my GP tutorial

Hi,

I created a multi-output Gaussian process tutorial with GeoPandas PyMC3. There might be some points to correct or adjust. Any comment before I share it to the world?

5 Likes

I think that’s a really nice integrated example. The only thing that I would like to see added is a map showing the predictive variance after fitting the GP. That way, you can show how the uncertainty is low close to the observed data locations and grows as we move away from those points.

2 Likes

Good suggestion. I pushed the changes. Thanks!

Hi @essicolo, thank you for sharing.

This is a very nice example of the intrinsic model of coregionalization (ICM) using Pymc.

In the model, I am not sure if * is a Kronecker product (⊗) ?

    ## Combined covariance
    cov_f = coreg * cov_feature # (B * K) or (B⊗K)?

There is a pymc.math.kron_dot function in pymc, but I am not sure it will work for this case.

Hi @DanhPhan,

Thanks for the review. I tried to modify the code either with pymc.math.kron_dot and pymc.gp.cov.Kron. The former needs a parameter m and returns a numpy array, so I guess, for a Kronecker product of kernels, the later should be used. But using it led to IndexError: index 1 is out of bounds for axis 1 with size 0, probably an error related to the size of the kernel inflated by the Kronecker product?

Hi @essicolo, it seems that the current pymc.gp.cov.Coregion does not support multi-outputs yet. The current implementation does not have the output_dim parameter compared to GPy Coregionalize

As the multitasks GP seems not fully supported in pymc yet, you may also try other libraries like Gpy, GPflow (tensorflow backend), GPytorch (pytorch backend), or recently mogptk (pytorch backend).

Here is my simple implementation using GPytorch with your data set. Although, I don’t like it as we need to write more boilerplate codes.

1 Like

Hi, Thanks for the examples! Since multi-output is not yet built in PyMC, I rebooted the tutorial with GPFlow (I’m not so good with Pytorch).

1 Like

@essicolo and @DanhPhan, you can indeed use PyMC for multi-output GPs. Basically you just need to add an extra input indicating which output you’re observing/predicting. See the small example here: Coregionalization model for two separable multidimensional Gaussian Process - #4 by bwengals. I’ve built it into my package for building GPs from tabular data, you can see it in action here and the code here.

2 Likes

Hi @BioGoertz, which multi-output GPs methods currently support in your package?

The multi-output GPs seems not fully supported in pymc yet, for example, we are not able to mix different kernels (like ExpQuad and Matern32) together in the Coregion regression model.

My package is just using Pymc3 as the only backend (for now, hope to expand it to GPFlow), so it can’t really do anything that Pymc3 can’t. I just implemented the bwengels example there (linear model of coregionalization, I believe?), but you’re right that it’s not immediately obvious how/whether you could use structurally distinct but correlated kernels for the two outputs.

@essicolo For your priors on lengthscale and variance, I might suggest using zero-avoiding priors (Gamma or InvGamma) as recommended here: Robust Gaussian Processes in Stan. That should avoid some of the issues you saw with HalfCauchy. You can also take a principled approach to defining them by scaling them to the observed quantiles of your input and output data.

As for the “meaning” of the W matrix, I’d suggest watching this: Multi Output Gaussian Processes, Mauricio Alvarez - YouTube and/or reading this: http://gpss.cc/gpss17/slides/multipleOutputGPs.pdf. As far as I understand it (and I’m no expert), coregionalization uses a single latent GP and considers the different outputs to be linear combinations of different samples from this GP. The rank of W is essentially the number of distinct samples that are combined in this way, so (I think) higher rank → more samples → more idiosyncratic behavior between the outputs.

Also, I’d recommend using a different color palette for the uncertainty vs the concentration.

It’s a great notebook!

2 Likes

@BioGoertz Thanks for the tips and links. I will update everything as soon as I have some time, and maybe try running the model with Gumbi!