GSoC 2022 Multi-outputs GPs

Hi everyone, I am Danh Phan, a PhD candidate at Monash University, Australia. My research topic focuses on Machine learning (Bayesian methods, choice models, tree-based, and NNs) for intelligent transport systems. I have recently worked on the Multi-outputs Gaussian Process (GP) for generating people’s travel activity time.

I have used Pymc for a while, and submitted a couple of PRs to pymc and aesara thanks to the great support from @pymc_devs, especially @ricardoV94 @twiecki and @brandonwillard.

I would love to work on the Multi-output Gaussian Processes in PyMC, with the potential mentors @bwengals and @fonnesbeck.

This project aims to extend the pymc.gp module to support for multi-output GPs. The tentative main tasks of this project may include:

  • Implement a MutitaskMean class that work well with current Mean classes in pymc.gp.mean
  • Upgrade the Coregionalization kernel (I see that we already have Coregion kernel in pymc.gp.cov, but it may need to be improved to accommodate multi-outputs), support Intrinsic coregionalization model (ICM), Semiparametric Latent Factor Model (SLFM), and Linear model of coregionalization (LMC) kernel.
  • Implement an Index kernel, and then a Hadamard kernel that wraps any kernel (like ExpQuad (RBF), Matern32, …) with the Index kernel.
  • Write tests for these new classes, and test their compatibility with other Mean and Cov kernels function/class in pymc.gp module. Make sure the compatibility between different Mean and Cov classes.
  • Write documents and examples for Multi-output GPs.

I plan to translate some codes from existing GP libraries, especially GPy and GPytorch for Coregion kernel, and GPytorch for Hadamard, since I have worked a couple of years with pytorch, so I’m quite familiar with the codes in GPytorch. I will make a detailed proposal and share it in this thread soon.

Thank you all!

7 Likes

Thanks for this! We look forward to seeing the proposal.

3 Likes

Hi @fonnesbeck and @bwengals,

I have made a draft of the proposal in this link. Please give me some suggestions in your convenience.

Also, regarding the Hadamard model, I am looking for a real example data set for this model, but could not able to find one. The example in GPytorch only use synthesis data, and I am not sure what is the referenced paper for this model.

When reading some papers, I see that the Convolutional Process (CONV) model is quite popular, and has been recently implemented in mogptk library. So, should we implement CONV instead of Hadamard model? Please advice.

I am also open to any ideas from all members of our PyMC community. Please let me know your opinions on this project.

Thank you.

2 Likes

I’d be interested in following the progress of this project and aiding in discussions as I will be a keen end-user.

In terms of datasets, it would be could to see some spatiotemporal models with multivariate outcomes. Some examples are the multi-infectious disease time series from project Tycho used in Flaxman 2015 and the continuous-space multi-crime dataset used in Aglietti 2019. There are plenty of time-series examples from finance too.

I’d focus on the most general and flexible solvers, which can be easily incorporated into a wider pymc model, rather than super-specific solutions. Even showing how to write a multi-output GP with a matrix mean function (new col for each output), or as a GP stacking each of the outcomes with a kronecker product representing the correlation between each of the outputs, would be extremely valuable.

3 Likes

Thank you @theo for the suggestion.

Sure, the multi-output GP Coregion Regression with Kronecker product will be added and/or improved in PyMC. This is the main task in phase 1 of the project’s proposal.

Also thanks for the datasets, I am reading and collecting more data and examples/applications of multi-output GPs, and this will help.

1 Like

Yup, agree with @theo, PyMC’s GPs wont ever be better than or more exhaustive than GPy/GPytorch/GPflows, and I think it’s not the goal, but we really want to implement the stuff that really fits well with MCMC inference, that has or enables lots of practical applications, and makes sense embedded in larger probabilistic models. So at least for me, its all about what would get used the most as opposed to what’s in the latest paper. Thinking about what has the most use cases is the hard one, and is really a judgement call.

So, for your specific question about CONV or CONV vs Hadamard, I’m not sure. I’m also not familiar with CONV (not to disagree with how popular it may be!), will def check it out.

1 Like

Have a look at the tinygp library for a good example of a gp library that is extensible and usable within ppl models with really clear code. The main problem I find with GPyTorch and GPFlow is that they are extremely rigid and impossible to hack/add as a component to a larger model. The tinygp library should give you a few ideas of how you might develop pymc’s gp functionality. The implementation of quasiseparable kernels is a simple extension for speeding up 1D GPs (common).

2 Likes

Hi @bwengals, thank you for the suggestion.

Yes, I would like to implement the method that suits PyMC’s MCMC inference, and that has or enables lots of practical applications for PyMC’s users. The multi-output Coregion Regression should be the popular and useful one.

I am also not sure how popular in practice the Convolutional Process (CONV) model is. Just want to raise it here since it was mentioned in Secion 5.3 of this review paper, and also in this lecture slides (p33-53).

For Hadamard model, there is a notebook example in GPytorch, but no reference to a related paper. I found one paper here, but not sure if it is right paper of the Hadamard model? Please advice!

1 Like

Thanks @theo for the info of the tinygp package. I see it has implemented Jax as a backend, and will definitely looking into details of its source codes.

Also, could you please elaborate more and give an example on this point:

The main problem I find with GPyTorch and GPFlow is that they are extremely rigid and impossible to hack/add as a component to a larger model.

Sure.

  1. I would say GPyTorch and GPFlow are packages for GP-only models. But oftentimes, you want a GP as small part of a much larger model. There is some integration between GPyTorch and the Pyro PPL, but it is very difficult to use. You want a GP framework which is easy to use and modular, and that won’t break other parts of the PPL when implemented.
  2. You might be developing a method for your really specific research case. Or you might be creating a new GP method. You want a GP framework where the source code is clear and you can easily extend/build on top of without completely starting afresh. GPFlow and GPyTorch are not this.
2 Likes

Thank you @theo and @bwengals for the suggestion.

After reading more on the topic, I agree that implementing Hadamard kernel is a good option, which can be extended later for other applications.

I have updated my proposal at this link, mostly on the Hadamard and the Schedule sections. Besides, recent PRs have been added.

I have also submitted my proposal to the GSOC at this link. If you think the proposal needs to be changed anything, please let me know @fonnesbeck @bwengals.

Thank you all :slight_smile:

4 Likes

Thanks for submitting!!

1 Like