Introducing Gumbi: the Gaussian Process Model Building Interface

Hello Pymc Community!

I wanted to share a project I’ve been working on: Gumbi, the Gaussian Process Model Building Interface. Inspired by what Bambi does for generalized linear models, the goal of Gumbi is to make it easy and intuitive to quickly prototype Gaussian Process models. Specifically designed for tabular data (pandas DataFrames), Gumbi takes care of transforming, standardizing, and reshaping data before feeding it into an auto-built Pymc model, and then un-doing all of that wrangling when it comes to making predictions. Gumbi also includes visualization tools for easy plotting of predictions. For now there is only a Pymc backend, but I may add a GPFlow backend as well.

As shown in the quickstart, Gumbi reduces building, fitting, predicting, and visualizing a model down to a few simple commands.

Read in some data and store it as a Gumbi DataSet:

import gumbi as gmb
import seaborn as sns
cars = sns.load_dataset('mpg').dropna()
ds = gmb.DataSet(cars, outputs=['mpg', 'acceleration'], log_vars=['mpg', 'acceleration', 'weight', 'horsepower', 'displacement'])

Create a Gumbi GP object and fit a model that predicts mpg from horsepower:

gp = gmb.GP(ds)
gp.fit(outputs=['mpg'], continuous_dims=['horsepower']);

Make predictions and plot!

X = gp.prepare_grid()
y = gp.predict_grid()
gmb.ParrayPlotter(X, y).plot()
sns.scatterplot(data=cars, x='horsepower', y='mpg', color=sns.cubehelix_palette()[-1], alpha=0.5);

Gumbi can handle multiple continuous input dimensions, multiple categorical input dimensions (correlated via a Linear Model of Coregionalization), and multiple outputs (also via LMC), along with Normal, Log-Normal, and Logit-Normal variables. Right now, only Marginal and MarginalSparse (DTC) implementations are supported, but I hope to add Latent variable support to allow for classification, heteroskedasticity, and other likelihood structures.

There’s still a lot of room for improvement, but the basic structure and functionality is in place. It’s available on PyPI, and I’ll try to put it on conda-forge soon. Take a look at the docs, which has thorough API documentation and example notebooks. I plan to add more explanations of some of the underlying data structures, and please let me know if anything is unclear!

I’m still developing the package, and contributions are welcome! This is my first open-source project, so bear with me while I figure it out as I go. I’m very open to suggestions on everything from API structure to inner workings and git guidelines. There are many features I still plan to implement, some of them obvious, so please open issues requesting specific features as this will help me prioritize.

And finally, let me know what you make with it! I’m excited to see how it does out “in the wild”.

22 Likes

Well done!

1 Like

@BioGoertz this is a fabulous addition to the Pymc3 interfaces. I use Bambi quite frequently to mimic the interface I have in R called brms, a front-end for RStan.

I will use your package and report back on its use here and at Github. Thank you for creating and sharing this wonderful tool that makes GP model building easier!

2 Likes

@BioGoertz, thanks for a great package! I look forward to using it. Would it be possible to have an example for folks who are coming from using gradient boosting trees or similar models to use Gumbi instead? I can help provide a gradient boosting tree based example, but not sure how to best represent it using Gumbi.

Hi @BioGoertz I have installed and tested gumbi on both Windows 10 and WSL2 under Anaconda setup. Please note that my numpy version was 1.21.5. I have tried but unable to install a numpy version >=1.15 #<= 1.19.3.

I installed an environment where numpy version is 1.19.3, but when I install other packages, numpy gets upgraded to 1.21.5.

Here are the commands I ran and the errors I got (Any resolution would be very helpful):

import seaborn as sns
cars = sns.load_dataset('mpg').dropna()
ds = gmb.DataSet(cars, outputs=['mpg', 'acceleration'], log_vars=['mpg', 'acceleration', 'weight', 'horsepower', 'displacement'])```

C:\Users\sreedatta\Anaconda3\envs\gumbi_env\lib\site-packages\theano\tensor\elemwise.py:826: RuntimeWarning: invalid value encountered in log
variables = ufunc(*ufunc_args, **ufunc_kwargs)


X = gp.prepare_grid()
y = gp.predict_grid()
gmb.ParrayPlotter(X, y).plot()
sns.scatterplot(data=cars, x=‘horsepower’, y=‘mpg’, color=sns.cubehelix_palette)


TypeError Traceback (most recent call last)
Input In [5], in
2 y = gp.predict_grid()
3 gmb.ParrayPlotter(X, y).plot()
----> 4 sns.scatterplot(data=cars, x=‘horsepower’, y=‘mpg’, color=sns.cubehelix_palette)

File ~\Anaconda3\envs\gumbi_env\lib\site-packages\seaborn_decorators.py:46, in _deprecate_positional_args..inner_f(*args, **kwargs)
36 warnings.warn(
37 "Pass the following variable{} as {}keyword arg{}: {}. "
38 "From version 0.12, the only valid positional argument "
(…)
43 FutureWarning
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
—> 46 return f(**kwargs)

File ~\Anaconda3\envs\gumbi_env\lib\site-packages\seaborn\relational.py:827, in scatterplot(x, y, hue, style, size, data, palette, hue_order, hue_norm, sizes, size_order, size_norm, markers, style_order, x_bins, y_bins, units, estimator, ci, n_boot, alpha, x_jitter, y_jitter, legend, ax, **kwargs)
823 return ax
825 p._attach(ax)
→ 827 p.plot(ax, kwargs)
829 return ax

File ~\Anaconda3\envs\gumbi_env\lib\site-packages\seaborn\relational.py:608, in _ScatterPlotter.plot(self, ax, kws)
603 scout_size = max(
604 np.atleast_1d(kws.get(“s”, [])).shape[0],
605 np.atleast_1d(kws.get(“c”, [])).shape[0],
606 )
607 scout_x = scout_y = np.full(scout_size, np.nan)
→ 608 scout = ax.scatter(scout_x, scout_y, **kws)
609 s = kws.pop(“s”, scout.get_sizes())
610 c = kws.pop(“c”, scout.get_facecolors())

File ~\Anaconda3\envs\gumbi_env\lib\site-packages\matplotlib_init_.py:1412, in _preprocess_data..inner(ax, data, *args, **kwargs)
1409 @functools.wraps(func)
1410 def inner(ax, *args, data=None, **kwargs):
1411 if data is None:
→ 1412 return func(ax, *map(sanitize_sequence, args), **kwargs)
1414 bound = new_sig.bind(ax, *args, **kwargs)
1415 auto_label = (bound.arguments.get(label_namer)
1416 or bound.kwargs.get(label_namer))

File ~\Anaconda3\envs\gumbi_env\lib\site-packages\matplotlib\axes_axes.py:4387, in Axes.scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs)
4384 if edgecolors is None:
4385 orig_edgecolor = kwargs.get(‘edgecolor’, None)
4386 c, colors, edgecolors =
→ 4387 self._parse_scatter_color_args(
4388 c, edgecolors, kwargs, x.size,
4389 get_next_color_func=self._get_patches_for_fill.get_next_color)
4391 if plotnonfinite and colors is None:
4392 c = np.ma.masked_invalid(c)

File ~\Anaconda3\envs\gumbi_env\lib\site-packages\matplotlib\axes_axes.py:4160, in Axes._parse_scatter_color_args(c, edgecolors, kwargs, xsize, get_next_color_func)
4158 if kwcolor is not None:
4159 try:
→ 4160 mcolors.to_rgba_array(kwcolor)
4161 except ValueError as err:
4162 raise ValueError(
4163 "‘color’ kwarg must be a color or sequence of color "
4164 "specs. For a sequence of values to be color-mapped, use "
4165 “the ‘c’ argument instead.”) from err

File ~\Anaconda3\envs\gumbi_env\lib\site-packages\matplotlib\colors.py:363, in to_rgba_array(c, alpha)
358 if isinstance(c, str):
359 raise ValueError("Using a string of single character colors as "
360 "a color sequence is not supported. The colors can "
361 “be passed as an explicit list instead.”)
→ 363 if len(c) == 0:
364 return np.zeros((0, 4), float)
366 # Quick path if the whole sequence can be directly converted to a numpy
367 # array in one shot.

TypeError: object of type ‘function’ has no len()

Hi sree_datta, thanks for giving it a shot! That error is just down to the color argument to seaborn’s scatterplot. You provided color=sns.cubehelix_palette, which is a function, but you need to provide it with a single color (in the example I specify color=sns.cubehelix_palette()[-1], which gives a dark purple, but you’re of course free to use any color seaborn accepts).

The numpy issue is unrelated, and should only cause an issue if you need the MultivariateUncertainParameterArray. Higher versions of numpy were giving really bizarre low-level linear algebra errors when calling mvup.dist and mvup.cov. In fact, if you wanted to try running the section on Multi-output regression (specifically the multivariate sampling bit) in the Cars Dataset notebook with your setup, and let me know if you get weird errors, that would help a lot!

Thanks for your interest! I’m really not at all familiar with gradient boosting trees, could you provide a rough example of your workflow? Take a look at the Cars Dataset notebook and see if that helps clarify its use.

I’m guessing you’ll need to be a bit more careful with your features using a GP compared to a GBT. In Gumbi, you specify which of your variables need to be log- or logit-transformed when you create a gmb.DataSet from a pd.Dataframe. Then, when you go to build/fit the GP, you declare which inputs you’re actually interested in by specifying which are to be treated as continuous variables (continuous_dims) and which as categorical (categorical_dims).

Continuous variables are treated as you might expect, while categorical variables are used to define separate but correlated Gaussian Processes. If you have one categorical variable, each value in that column defines a separate GP; if you have multiple, each unique combination of categorical variables gets its own. This is done through what is known as the “Linear Model of Corregionalization”, or LMC. As this name suggests, the GP learns a linear correlation between the classes, but this can be positive or negative, and it will happily learn there’s no correlation if the data indicate that.

Finally, multi-output regression is also acheived through an LMC implementation.

Hope that helps!

Thanks @BioGoertz for your quick feedback. I will make the changes pertinent to seaborn and also test the Multi-output regression example. I will share the test results back here.

Great! If you could actually log any issues/errors like this on the Github page that’ll make them easier to keep track of, and we can keep this thread for more high-level, conceptual discussion.

2 Likes

@BioGoertz I have opened an issue ticket here Assertion Error with Multi Output Regression on WSL2

@BioGoertz unable to install gumbi-0.1.8. using pip install gumbi. I have posted the error at GitHub

(this is now fixed - it installs now!)

Hi there, cool new package! I have had fun with bambi and thought I’d try this over vanilla solutions. To be honest I am a bit confused though, I just did a simple model with pymc3 and now tried gumbi. But the predictions with predict are way off compared to MarginalApprox (MarginalSparse in v3). What exactly does prepare grid do when you have multiple continuous parameters?
I.e. I am quite confused as to why the ParameterArray input of points has to be 1d.
Sorry for the thousand probably silly newbie questions.

Hi @Oliver, thanks for giving it a shot! Sorry for the delay in getting back to you. That’s interesting that the predictions are very different, could you provide a minimal example and open an issue on Github?

To your question about prepare_grid, it just builds a Cartesian grid across all dimensions. It defaults to 100 points spanning -2.5σ to +2.5σ, where σ is the standard deviation of the (transformed) dimension. I’m not sure what you mean by “the ParameterArray input of points has to be 1d”. Maybe you’re referring to the at keyword argument? That’s intended to indicate a specific “slice” of your multidimensional space. You can see its usage in this example in the documentation:

...

gp.fit(outputs=['mpg'], continuous_dims=['horsepower','weight', 'cylinders'], linear_dims=['horsepower','weight', 'cylinders']);

for ax, cyl in zip(axs, cylinders):

    XY = gp.prepare_grid(at=gp.parray(cylinders=cyl))
    X = XY['horsepower']
    Y = XY['weight']
    z = gp.predict_grid(with_noise=False)

...

There, a GP is fit on the “cars” dataset. We’re interested in looking at the dependence of mpg on horsepower and weight surface at each value of cylinder. Now, the cylinder dimension is technically discrete, not continuous, but I haven’t implemented a discrete kernel yet, so it’s treated as continuous. For plotting purposes, we use the at keyword argument to specify the specific “slice” along the cylinder dimension in cylinder-horsepower-weight space to make predictions along.

Hope that helps, let me know if anything’s unclear! I am still developing/maintaining the package, as I have time, so suggestions/issues/questions are very welcome!