Hi everyone, I am Danh Phan, a PhD candidate at Monash University, Australia. My research topic focuses on Machine learning (Bayesian methods, choice models, tree-based, and NNs) for intelligent transport systems. I have recently worked on the Multi-outputs Gaussian Process (GP) for generating people’s travel activity time.
I have used Pymc for a while, and submitted a couple of PRs to pymc and aesara thanks to the great support from @pymc_devs, especially @ricardoV94 @twiecki and @brandonwillard.
I would love to work on the Multi-output Gaussian Processes in PyMC, with the potential mentors @bwengals and @fonnesbeck.
This project aims to extend the
pymc.gp module to support for multi-output GPs. The tentative main tasks of this project may include:
- Implement a
MutitaskMean class that work well with current
Mean classes in
- Upgrade the Coregionalization kernel (I see that we already have
Coregion kernel in
pymc.gp.cov, but it may need to be improved to accommodate multi-outputs), support Intrinsic coregionalization model (ICM), Semiparametric Latent Factor Model (SLFM), and Linear model of coregionalization (LMC) kernel.
- Implement an
Index kernel, and then a
Hadamard kernel that wraps any kernel (like
Matern32, …) with the
- Write tests for these new classes, and test their compatibility with other
Cov kernels function/class in
pymc.gp module. Make sure the compatibility between different
- Write documents and examples for Multi-output GPs.
I plan to translate some codes from existing GP libraries, especially GPy and GPytorch for
Coregion kernel, and GPytorch for
Hadamard, since I have worked a couple of years with pytorch, so I’m quite familiar with the codes in GPytorch. I will make a detailed proposal and share it in this thread soon.
Thank you all!
Thanks for this! We look forward to seeing the proposal.
Hi @fonnesbeck and @bwengals,
I have made a draft of the proposal in this link. Please give me some suggestions in your convenience.
Also, regarding the
Hadamard model, I am looking for a real example data set for this model, but could not able to find one. The example in GPytorch only use synthesis data, and I am not sure what is the referenced paper for this model.
When reading some papers, I see that the Convolutional Process (CONV) model is quite popular, and has been recently implemented in mogptk library. So, should we implement CONV instead of Hadamard model? Please advice.
I am also open to any ideas from all members of our PyMC community. Please let me know your opinions on this project.
I’d be interested in following the progress of this project and aiding in discussions as I will be a keen end-user.
In terms of datasets, it would be could to see some spatiotemporal models with multivariate outcomes. Some examples are the multi-infectious disease time series from project Tycho used in Flaxman 2015 and the continuous-space multi-crime dataset used in Aglietti 2019. There are plenty of time-series examples from finance too.
I’d focus on the most general and flexible solvers, which can be easily incorporated into a wider pymc model, rather than super-specific solutions. Even showing how to write a multi-output GP with a matrix mean function (new col for each output), or as a GP stacking each of the outcomes with a kronecker product representing the correlation between each of the outputs, would be extremely valuable.
Thank you @theo for the suggestion.
Sure, the multi-output GP Coregion Regression with Kronecker product will be added and/or improved in PyMC. This is the main task in phase 1 of the project’s proposal.
Also thanks for the datasets, I am reading and collecting more data and examples/applications of multi-output GPs, and this will help.
Yup, agree with @theo, PyMC’s GPs wont ever be better than or more exhaustive than GPy/GPytorch/GPflows, and I think it’s not the goal, but we really want to implement the stuff that really fits well with MCMC inference, that has or enables lots of practical applications, and makes sense embedded in larger probabilistic models. So at least for me, its all about what would get used the most as opposed to what’s in the latest paper. Thinking about what has the most use cases is the hard one, and is really a judgement call.
So, for your specific question about CONV or CONV vs Hadamard, I’m not sure. I’m also not familiar with CONV (not to disagree with how popular it may be!), will def check it out.
Have a look at the tinygp library for a good example of a gp library that is extensible and usable within ppl models with really clear code. The main problem I find with GPyTorch and GPFlow is that they are extremely rigid and impossible to hack/add as a component to a larger model. The tinygp library should give you a few ideas of how you might develop pymc’s gp functionality. The implementation of quasiseparable kernels is a simple extension for speeding up 1D GPs (common).
Hi @bwengals, thank you for the suggestion.
Yes, I would like to implement the method that suits PyMC’s MCMC inference, and that has or enables lots of practical applications for PyMC’s users. The multi-output Coregion Regression should be the popular and useful one.
I am also not sure how popular in practice the Convolutional Process (CONV) model is. Just want to raise it here since it was mentioned in Secion 5.3 of this review paper, and also in this lecture slides (p33-53).
For Hadamard model, there is a notebook example in GPytorch, but no reference to a related paper. I found one paper here, but not sure if it is right paper of the Hadamard model? Please advice!
Thanks @theo for the info of the
tinygp package. I see it has implemented
Jax as a backend, and will definitely looking into details of its source codes.
Also, could you please elaborate more and give an example on this point:
The main problem I find with GPyTorch and GPFlow is that they are extremely rigid and impossible to hack/add as a component to a larger model.
Thank you @theo and @bwengals for the suggestion.
After reading more on the topic, I agree that implementing
Hadamard kernel is a good option, which can be extended later for other applications.
I have updated my proposal at this link, mostly on the
Hadamard and the Schedule sections. Besides, recent PRs have been added.
I have also submitted my proposal to the GSOC at this link. If you think the proposal needs to be changed anything, please let me know @fonnesbeck @bwengals.
Thank you all