GSoC 2022 Multi-outputs GPs

DanhPhan · March 20, 2022, 10:52am

Hi everyone, I am Danh Phan, a PhD candidate at Monash University, Australia. My research topic focuses on Machine learning (Bayesian methods, choice models, tree-based, and NNs) for intelligent transport systems. I have recently worked on the Multi-outputs Gaussian Process (GP) for generating people’s travel activity time.

I have used Pymc for a while, and submitted a couple of PRs to pymc and aesara thanks to the great support from @pymc_devs, especially @ricardoV94 @twiecki and @brandonwillard.

I would love to work on the Multi-output Gaussian Processes in PyMC, with the potential mentors @bwengals and @fonnesbeck.

This project aims to extend the pymc.gp module to support for multi-output GPs. The tentative main tasks of this project may include:

Implement a MutitaskMean class that work well with current Mean classes in pymc.gp.mean
Upgrade the Coregionalization kernel (I see that we already have Coregion kernel in pymc.gp.cov, but it may need to be improved to accommodate multi-outputs), support Intrinsic coregionalization model (ICM), Semiparametric Latent Factor Model (SLFM), and Linear model of coregionalization (LMC) kernel.
Implement an Index kernel, and then a Hadamard kernel that wraps any kernel (like ExpQuad (RBF), Matern32, …) with the Index kernel.
Write tests for these new classes, and test their compatibility with other Mean and Cov kernels function/class in pymc.gp module. Make sure the compatibility between different Mean and Cov classes.
Write documents and examples for Multi-output GPs.

I plan to translate some codes from existing GP libraries, especially GPy and GPytorch for Coregion kernel, and GPytorch for Hadamard, since I have worked a couple of years with pytorch, so I’m quite familiar with the codes in GPytorch. I will make a detailed proposal and share it in this thread soon.

Thank you all!

fonnesbeck · March 20, 2022, 3:44pm

Thanks for this! We look forward to seeing the proposal.

DanhPhan · March 30, 2022, 2:22am

Hi @fonnesbeck and @bwengals,

I have made a draft of the proposal in this link. Please give me some suggestions in your convenience.

Also, regarding the Hadamard model, I am looking for a real example data set for this model, but could not able to find one. The example in GPytorch only use synthesis data, and I am not sure what is the referenced paper for this model.

When reading some papers, I see that the Convolutional Process (CONV) model is quite popular, and has been recently implemented in mogptk library. So, should we implement CONV instead of Hadamard model? Please advice.

I am also open to any ideas from all members of our PyMC community. Please let me know your opinions on this project.

Thank you.

theo · April 5, 2022, 3:46pm

I’d be interested in following the progress of this project and aiding in discussions as I will be a keen end-user.

In terms of datasets, it would be could to see some spatiotemporal models with multivariate outcomes. Some examples are the multi-infectious disease time series from project Tycho used in Flaxman 2015 and the continuous-space multi-crime dataset used in Aglietti 2019. There are plenty of time-series examples from finance too.

I’d focus on the most general and flexible solvers, which can be easily incorporated into a wider pymc model, rather than super-specific solutions. Even showing how to write a multi-output GP with a matrix mean function (new col for each output), or as a GP stacking each of the outcomes with a kronecker product representing the correlation between each of the outputs, would be extremely valuable.

DanhPhan · April 6, 2022, 11:20am

Thank you @theo for the suggestion.

Sure, the multi-output GP Coregion Regression with Kronecker product will be added and/or improved in PyMC. This is the main task in phase 1 of the project’s proposal.

Also thanks for the datasets, I am reading and collecting more data and examples/applications of multi-output GPs, and this will help.

bwengals · April 6, 2022, 6:54pm

Yup, agree with @theo, PyMC’s GPs wont ever be better than or more exhaustive than GPy/GPytorch/GPflows, and I think it’s not the goal, but we really want to implement the stuff that really fits well with MCMC inference, that has or enables lots of practical applications, and makes sense embedded in larger probabilistic models. So at least for me, its all about what would get used the most as opposed to what’s in the latest paper. Thinking about what has the most use cases is the hard one, and is really a judgement call.

So, for your specific question about CONV or CONV vs Hadamard, I’m not sure. I’m also not familiar with CONV (not to disagree with how popular it may be!), will def check it out.

theo · April 6, 2022, 7:26pm

Have a look at the tinygp library for a good example of a gp library that is extensible and usable within ppl models with really clear code. The main problem I find with GPyTorch and GPFlow is that they are extremely rigid and impossible to hack/add as a component to a larger model. The tinygp library should give you a few ideas of how you might develop pymc’s gp functionality. The implementation of quasiseparable kernels is a simple extension for speeding up 1D GPs (common).

DanhPhan · April 7, 2022, 11:25am

Hi @bwengals, thank you for the suggestion.

Yes, I would like to implement the method that suits PyMC’s MCMC inference, and that has or enables lots of practical applications for PyMC’s users. The multi-output Coregion Regression should be the popular and useful one.

I am also not sure how popular in practice the Convolutional Process (CONV) model is. Just want to raise it here since it was mentioned in Secion 5.3 of this review paper, and also in this lecture slides (p33-53).

For Hadamard model, there is a notebook example in GPytorch, but no reference to a related paper. I found one paper here, but not sure if it is right paper of the Hadamard model? Please advice!

DanhPhan · April 7, 2022, 11:33am

Thanks @theo for the info of the tinygp package. I see it has implemented Jax as a backend, and will definitely looking into details of its source codes.

Also, could you please elaborate more and give an example on this point:

The main problem I find with GPyTorch and GPFlow is that they are extremely rigid and impossible to hack/add as a component to a larger model.

theo · April 7, 2022, 12:04pm

Sure.

I would say GPyTorch and GPFlow are packages for GP-only models. But oftentimes, you want a GP as small part of a much larger model. There is some integration between GPyTorch and the Pyro PPL, but it is very difficult to use. You want a GP framework which is easy to use and modular, and that won’t break other parts of the PPL when implemented.
You might be developing a method for your really specific research case. Or you might be creating a new GP method. You want a GP framework where the source code is clear and you can easily extend/build on top of without completely starting afresh. GPFlow and GPyTorch are not this.

DanhPhan · April 16, 2022, 11:57pm

Thank you @theo and @bwengals for the suggestion.

After reading more on the topic, I agree that implementing Hadamard kernel is a good option, which can be extended later for other applications.

I have updated my proposal at this link, mostly on the Hadamard and the Schedule sections. Besides, recent PRs have been added.

I have also submitted my proposal to the GSOC at this link. If you think the proposal needs to be changed anything, please let me know @fonnesbeck @bwengals.

Thank you all

bwengals · April 18, 2022, 7:56pm

Thanks for submitting!!

Topic		Replies	Views
Need for a review of my GP tutorial Sharing	11	1407	April 4, 2022
[PyMCon Web Series 02] An introduction to multi-output Gaussian processes using PyMC (Feb 21, 2023) (Danh Phan) PyMCon Web Series gaussian_process	11	1866	March 6, 2023
Multi-output gaussian processes Questions	13	5463	October 22, 2017
What are the applications of Multi-outputs Gaussian processes (MOGPs) in your work? Development gaussian_process	3	681	March 20, 2023
Develop multi-output GP model with linear mean for each task and with learnable hyperparameters v5	6	874	April 17, 2023

GSoC 2022 Multi-outputs GPs

Related topics