[PyMCon Web Series 02] An introduction to multi-output Gaussian processes using PyMC (Feb 21, 2023) (Danh Phan)

An Introduction to Multi-Output Gaussian Processes using PyMC

Speaker: Danh Phan

Event type: Live webinar
Date: Feb 21st 2023 (subscribe here for email updates)
Time: 22:00 UTC
Register for the event on Meetup to get the Zoom link
Talk Code Repository: On GitHub
Web App: Interest rates prediction for US, AU and UK

NOTE: The event is recorded. Subscribe to the PyMC YouTube for notifications.

Sponsor

We thank our sponsors for supporting PyMC and the PyMCon Web Series. If you would like to sponsor us, contact us for more information.

Mistplay is the #1 Loyalty Program for mobile gamers - with over 20 million users worldwide. Millions of gamers use our platform to discover games, connect with friends, and earn awesome rewards. We are a fast growing profitable company, recently ranked as the 3rd fastest growing technology company in Canada. Our passion to innovation drives our growth across the industry with the development of new apps, powerful ad tech tools, and the recent launch of a publishing division for mobile games.

Mistplay is hiring for a Senior Data Scientist (Remote or Montreal,QC).

Content

Video: Interview with Danh Phan (7 minutes)

Video: Intro to Multi-Output Gaussian Processes Using PyMC

Welcome to the second event of the PyMCon Web Series! As part of this series, most events will have an async component and a live talk.

In this case, Danh, as part of the async component, prepared a full repository for the community to engage in before the talk. It includes multiple colabs, and pdf slide deck

Take a look before the talk to share your questions below and be prepared for the discussion and post

Abstract of the talk

Multi-output Gaussian processes have recently gained strong attention from researchers and have become an active research topic in machine learning’s multi-task learning. The advantage of multi-output Gaussian processes is their capacity to simultaneously learn and infer many outputs which have a similar source of uncertainty from inputs.

This talk presents to audiences how to build multi-output Gaussian processes in PyMC. It first introduces the concept of Gaussian processes (GPs) and multi-output GPs and how they can address real problems in several domains. It then shows how to implement multi-output GPs models such as the intrinsic coregionalization model (ICM) and the linear model of coregionalization (LCM) in Python using PyMC with real-world datasets.

The talk aims to get users quickly up and performing GPs, especially multi-output GPs using PyMC. Several examples with time-series datasets are used to illustrate different GPs features. This presentation will allow users to leverage GPs to analyze their data effectively.

6 Likes

Here’s a link to Danhs interview ahead of the event. Looking forward to seeing you all at the event

question about how important the joint normality assumption is:

I have a data set where I feel good about marginal distributions of y_1 y_2 y_3 being gaussian, can model them separately just fine, but not sure about them being jointly normal (similar to teardrop shape on multivariate analysis - Is it possible to have a pair of Gaussian random variables for which the joint distribution is not Gaussian? - Cross Validated ) due to some heteroskedacity or something,

Examples of bivariate distribution with standard normal marginals.

can choosing a good kernel on cross-covariance deal with something like this or would I need some other strategy in general for a combined model?

@nfultz It’s not GPs but this is something copulas are built to handle

edit: just clicked your link to find that it’s explicitly from copulas. Disregard.

1 Like

I’m curious on the baseball example what the benefits from a business standpoint come with modeling this at all?

Looking at it, it seems like a rolling average would probably do fine to smooth the data. Looking at the right side of the plots, the GPs are not going to extrapolate well. You can fill in between games, but I’m not sure that’s useful. So I’m curious to what value the process of modeling here actually brings.

For the baseball example. Is there a benefit to using multi-output vs. single-output?

Was thinking there was a way to use the Xs to get a fan or teardrop shape on Ys. Not sure if it would work.

Thank you @fonnesbeck.
So, is it ok infer that multi-output has a similar effect on interpolation as multi-level models on inference?
i.e It helps in areas where some groups (or outputs in the GP case) have fewer data

As presented, perhaps not much, but it can be helpful when you have players with varying amounts of data. If we know there are correlations among the players, then we can get better inferences for players that have fewer games.

You never want to extrapolate beyond the lengthscale of the GP, that’s true. If you were interested in extrapolation, you would either have another function to use as the GP’s mean function, or have an additive kernel that includes a component with a longer lengthscale.

I think my solution is lurking in this slide somewhere:

image

Can we modify \sigma^2 to vary with location of x_i and not be fixed? And maybe allow correlations between y_{1j} and y_{2j} instead of the 0s on off-diagonal blocks?