Multi-output gaussian processes

Hi,
is there a (preferably simple) way to implement multi-output gaussian processes in pymc? As for “additional” features I think I need “non-centered cross-covariance” (i.e. different outputs being related with a small time delay) and “negative cross-covariance” (i.e. different outputs related in opposite way). Are such models common at all?

By multi-output GP do you mean high dimension observation or repeat observation of the same GP?
For high dimension problem you just need to specify the dimension in your Mean and Cov function.
For repeat observation of the same GP @jordan-melendez gives quite a good solution in another post: Multiple (uncertain) function observations of the same Gaussian process

I think the additional features you need is also possible even with the current API, you just need to carefully structure and parameterize your Cov function.

As I understand, I meant multidimensional observations - just a couple of dimensions, actually. And I don’t quite get what’s the use of repeated observations discussed in that post, as I see they only share parameters of covariance function.

And how do I specify dimensions in mean and cov functions? Do I have to create custom funcs? As I understand, those described in the docs support only multidimensional inputs, not outputs - can I combine them somehow?

Yes you are right, there is actually not trivial to do as we dont have output_dim in cov function. You will need to construct a multiple output kernels by multiplying two Cov function/matrix (some information in GPy), squeeze the output into 1D, and model the whole thing in using MvNormal.

Maybe @bwengals knows a better way to do it?

Yep, @junpenglao is right. Also check out this thread.

I tried a very simple approach for two-output processes: add a dummy independent dimension which equals 0 for values of first output and 1 for the second. To allow for some time shift between outputs I implemented covariance function this way (for two outputs only):

class ExpQuad(pm.gp.cov.Stationary):
    def __init__(self, shift, **kwargs):
        super(ExpQuad, self).__init__(**kwargs)
        self.shift = shift
    
    def full(self, X, Xs=None):
        X = X + [0, self.shift] * tt.eq(X[:, 1], 1)[:, None]
        if Xs is not None:
            Xs = Xs + [0, self.shift] * tt.eq(Xs[:, 1], 1)[:, None]
        X, Xs = self._slice(X, Xs)
        return tt.exp(-0.5 * self.square_dist(X, Xs))

It takes shift parameter which represents the shift of one output relative to the other. Using this function I created a model:

with pm.Model() as model:
    c = pm.Cauchy('c', 0, 0.1)
    eta = pm.HalfCauchy("eta", 0.1)
    ls = pm.Lognormal("ls", 0, 0.5, shape=2)
    shift = pm.Normal('shift', 0, 0.5)
    cov = ExpQuad(input_dim=2, ls=ls, shift=shift)

    gp = pm.gp.Latent(mean_func=pm.gp.mean.Constant(c),
                      cov_func=eta**2 * cov)
    f = gp.prior('f', X=X)

    σ = pm.HalfCauchy('σ', 0.001)
    y_ = pm.Cauchy('y', f, σ, observed=y_train.flatten())

    trace = pm.sample(njobs=2)

    y_pred = gp.conditional('y_pred', Xnew=X_pred)
    samples = pm.sample_ppc(trace, vars=[y_pred], samples=1000)

with training data generated this way:

x_train = np.concatenate([np.random.rand(20) * 3])
y_train = np.vstack([np.sin(x_train[:20] * 2), np.sin(x_train[:20] * 2 + 1)])
y_train += np.random.randn(*y_train.shape) * 0.2

x1 = tt.stack(x_train, x_train)
x2 = tt.repeat(tt.arange(2)[:, None], len(x_train), axis=1)
X = tt.stack([x1.flatten(), x2.flatten()], axis=1).eval()

x_pred = np.linspace(x_train.min() - 0.2, x_train.max() + 0.2, num=200)
X_pred = np.stack([np.tile(x_pred, 2), np.repeat(np.arange(2), len(x_pred))], 1)

So the true function for the first output is y1 = sin(2x) and for the second y2 = sin(2x + 1).

The ppc results look good:
image

But the interval for the shift is very wide:
image

From the fits at the ppc plot, it is clearly visible that the shift between curves is close to 0.5 (the true value), but the posterior doesn’t capture it.

So, does this model look good to you at all (also considering the extension to 3 and more outputs)? And is there something I can change to estimate the shift better?

2 Likes

I think in your case currently, the shift is unidentifiable: if you are modelling the 2 curves separately (without the shift), you can still get the same posterior prediction curve as in your figure. The reason being that in GP, the posterior prediction is a conditional probability of the training data, meaning that the posterior prediction will be most likely bounded around the observation/training data.

Oh, there was a bug in my covariance function - replace [0, self.shift] with [self.shift, 0]. After this fix I get this posterior:
image
As you see shift is more constrained. I understand that further improvements are most probably related to the model itself, not its implementation, so I guess they are out of scope here?
Anyway, what do you think about expanding this model for more output values? As I understand, I need to add a dummy input dimension (“indicator”) per output value and calculate pairwise shifts correctly in the covariance function. Or maybe there is a simpler/better way?

1 Like

Actually, I can think of another model to capture relationships between several outputs, but don’t see how to implement it in pymc. The general model looks like y_i(t) = g_i(f(h_i(t))) where for simplicity g and h have parametric forms - e.g. the linear case y_i(t) = a_i * f(t + dt_i) + b_i.

Here is what I tried:

with pm.Model() as model:

eta = pm.HalfCauchy("eta", 0.1)
ls = pm.Lognormal("ls", 0, 0.5)

shift = pm.Normal('shift', 0, 0.5)
ampc = pm.Normal('ampc', 0, 1, shape=2)

cov = eta**2 * pm.gp.cov.ExpQuad(1, ls)> 
gp = pm.gp.Latent(cov_func=cov)

# this line obviously doesn't work
f = gp.prior('f', X=tt.concatenate([x_train, x_train + shift])[:, None])

y_true = tt.concatenate([f.reshape((2, -1))[0],
                         ampc[0] + ampc[1] * f.reshape((2, -1))[1]])
σ = pm.HalfCauchy('σ', 0.001)
y_ = pm.Cauchy('y', y_true, σ, observed=y_train.flatten())

Here we can’t give random X arguments to gp prior. Is it possible somehow along these lines?

Unfortunately, the gp object needs to know the shape that f will have, since it constructs a random variable internally. When given a numpy array for X it can infer it, but if you give a theano tensor, you need to provide the shape arg.

f = gp.prior('f', X=tt.concatenate([x_train, x_train + shift])[:, None], shape=y_train.size)

Here’s an example of this being done.

Since X can pretty much be anything, you can do some crazy stuff like having the input to a GP be another GP or using a GP as a mean function of another GP. Really nice plots btw

1 Like

Wow, really? I read most of the GP docs, and didn’t see this option (having something random as X) - I thought it’s just not implemented because of some difficulties. Thanks for pointing at it!
Btw, do you know of any other example where this can be useful?

As for the plots - I just prefer ppc shown with fixed-quantile filled areas instead of “density” the pymc plot function produces. With fixed-quantiles, I can more clearly see where exactly uncertainties are larger and by how much. But obviously they are completely unsuitable for multimodal posteriors.

Thanks again for your advice, it works wonderfully for real data! I use both the mentioned simple model with y1 = f(t), y2 = a + b f(t + dt2), y3 = c + d f(t + dt3) for one set of observations y1, y2, y3, and even a more complex model y1 = f(t), y2 = g(t), y3 = a + b f(t + dt1) + c g(t + dt2) for another set of y1, y2, y3 (f, g here are GPs which share covariance function parameters). Both work really well and give sensible intervals for all parameters. I think pymc3 currently is the most general library for gaussian processes, and it’s still easy to implement such models.

The only downside I experience is speed - it takes on the order of 10 sec per iteration when all the time series contain 50+ points.

1 Like

That’s great! Just a small side note: you can try minibatch with meanfield approximation to speed up the computation. In combination with informative prior, the bias is usually quite small.

When trying out and comparing several models which combine GPs in different ways I experienced some things which can be improved as it seems, or maybe I just don’t understand how to use the features correctly.

  • Regular sampling from GPs. I do it this way:
with pm.Model() as model:
    ...
    gp = pm.gp.Latent(...)
    f = gp.prior('f', X=...)
    y = pm.Cauchy('y', f, sd, observed=obs)

    trace = pm.sample()

    f_pred = gp.conditional('f_pred', Xnew=...)
    samples = pm.sample_ppc(trace, vars=[f_pred])

One of the issues in this case is that the model contains f_pred variable and I can’t e.g. compute the logp of the base model without this variable (which is used for prediction only).

  • More complex sampling, e.g. both f(x) and a f(x) + b.
with pm.Model() as model:
    ...
    a = ...
    b = ...
    gp = pm.gp.Latent(...)
    f = gp.prior('f', X=...)
    y = pm.Cauchy('y', f, sd, observed=obs_y)
    z = pm.Cauchy('z', a * f + b, sd, observed=obs_z)

    trace = pm.sample()

    f_pred = gp.conditional('f_pred', Xnew=...)
    samples = pm.sample_ppc(trace, vars=[f_pred])

I see two issues here. One is not being able to generate samples from z for new X values at all with current implementation of sample_ppc - but this can be fixed in a straightforward way, see on github. But even if sample_ppc returns ppc samples corresponding to samples from trace, one still has to perform additional computations after sampling. A better way would be to allow something like this:

    fy_pred = gp.conditional('f_pred', Xnew=...)
    fz_pred = a * fy_pred + b
    samples = pm.sample_ppc(trace, vars=[fy_pred, fz_pred])

(currently it doesn’t work), but if it’s possible to somehow reuse the formula from the specification of z = above it would be the best. Not sure how difficult/sensible each of these points is, but at least something in this direction would be useful.

  • If the model needs different (e.g. shifted) X values for observations of y and z it becomes even less convenient. The only way I found is doing f = gp.prior('f', X=tt.concatenate([x, x + shift], shape=2 * x.size).reshape(2, x.size) and then f_y = f[0], f_z = a * f[1] + b. An intuitive f_y = gp.prior('f', X=x), f_z = a * gp.prior('f', X=x + shift) + b seems to mean something different. And of course, the same shift has to be applied explicitly when specifying gp.conditional for sampling, and a, b have to be used again in entirely different place, after sampling PPC.

  • Oh, and I’m not even talking about using and combining two GPs…

It’s easy to make a typo/mistake in one of the steps, as there is quite a bit of repetition as you see. Do you think it’s possible to implement something which can be used similar this:

gp = pm.gp.Latent(…)
fy = gp.prior(‘fy’, X=x)
fz = a * gp.prior(‘fz’, X=x + shift) + b

trace = pm.sample()

fy_pred = fy.conditional(‘fy_pred’, Xnew=xnew)
fz_pred = fz.conditional(‘fz_pred’, Xnew=xnew)
samples = pm.sample_ppc(trace, vars=[fy_pred, fz_pred])

1 Like