Gaussian Process Regression with hyperpriors

My data:

I have a matrix of N predictors, with t observations for the model training:
X (t x N)
I have a target vector (the thing I would like to predict):
y (t,)
I have a vector of test cases for each predictor, in order to perform the forecast:
X_new (1 x N)

I would like to perform a multivariate Gaussian Process regression with hyperpriors p(θ), whereby optimum hyperparameters (θ = ℓ, a, σ) (see code below) are first determined by Metropolis MCMC and then the optimum regression weight parameters β are determined (y = βX + ε) in order to perform a forecast y_new = βX_new. However I’m not quite sure if I am implementing this correctly in pyMC3…

So my understanding is that parameters μ and Σ (see code below) correspond to the mean and covariance of the model prior p(β). I’m wondering, does my code below just take one random sample from the defined distributions of each of the hyperparameters and then perform MCMC sampling on the model posterior p(β | y, X, θ) to get the regression weights? Or does it perform MCMC sampling on the posterior of the hyperparameters p(θ | y, X) and the model posterior p(β | y, X, θ)? If it’s the former, how can I change my code to implement hyperpriors and optimise them with Metropolis MCMC, and finally perform a prediction of y?

Side note - the prior on my hyperparameters would be weakly informative. All I know is that they should all be > 0.


with pm.Model() as model:
    ℓ = pm.Gamma('ℓ', alpha=1, beta=10, shape=X.shape[1])
    a = pm.HalfCauchy('a', beta=10)
    σ = pm.HalfCauchy("σ", beta=10)

    μ =
    Σ = a *,active_dims=np.arange(X.shape[1]),ls=ℓ)

    gp =μ, cov_func=Σ)
    y_ = gp.marginal_likelihood('y_',X=X, y=y, noise=σ)

    MAP = pm.find_MAP()
    method = pm.Metropolis()
    trace = pm.sample(1000, method, MAP, random_seed=1, progressbar=True)
    func = gp.conditional('func', Xnew=X_new)
    fmean, fvar = gp.predict(Xnew=X_new, point=MAP)

Think I figured it out in the end with a bit of playing around:

 with pm.Model() as model:
    # hyperpriors
    ℓ_mu = pm.Normal('ℓ_mu', mu=-10, sd=2)
    ℓ_sd = pm.Gamma('ℓ_sd', alpha=10, beta=10)
    a_mu = pm.Normal('a_mu', mu=-1, sd=2)
    a_sd = pm.Gamma('a_sd', alpha=5, beta=10)
    σ_mu = pm.Normal('σ_mu', mu=-1, sd=2)
    σ_sd = pm.Gamma('σ_sd', alpha=5, beta=10)

    # priors
    logℓ = pm.Normal('log(ℓ)', mu=ℓ_mu, sd=ℓ_sd) 
    loga = pm.Normal('log(a)', mu=a_mu, sd=a_sd)
    logσ = pm.Normal('log(σ)', mu=σ_mu, sd=σ_sd)
    ℓ = pm.Deterministic('ℓ', tt.exp(logℓ)) #kernel length-scale
    a = pm.Deterministic('a', tt.exp(loga)) #signal variance
    σ = pm.Deterministic('σ', tt.exp(logσ)) #noise variance
    μ =
    Σ = a *[1],active_dims=np.arange(X.shape[1]),ls=ℓ)

    gp =μ, cov_func=Σ)
    y_ = gp.marginal_likelihood('y_',X=X, y=y, noise=σ)
    trace = pm.sample(draws=1000, chains=5, step=pm.Metropolis(), random_seed=1, progressbar=False)
    func = gp.conditional('func', Xnew=X_new, pred_noise=True)
    fmean, fvar = gp.predict(X_new, point=trace[-1], pred_noise=True)
1 Like

Could please add fake data? I want to run your code.

You could use a sklearn dataset maybe, e.g Boston house prices.

The data I have been using is zero-mean and typically N>n. I’ve made the Boston dataset here zero-mean also, however for the Boston data n>>N so I’m not to sure how the GP will perform. You may also need to adjust the hyper-priors (particularly the characteristic length scale parameter) as the parameters from my last comment are probably a bit dataset specific

from sklearn.datasets import load_boston
boston = load_boston()
X = np.zeros((
#create zero-mean predictors
for N in range([1]):
    p = np.polyfit(np.arange([0]),[:,N],1)
    line = np.polyval(p, np.arange([0]))
    X[:,N] =[:,N] - line
X_new = X[-1,:].reshape(1,X.shape[1]) #test case
X = X[:-2,:] #predictors
#create zero-mean target
p = np.polyfit(np.arange([0]),,1)
line = np.polyval(p, np.arange([0]))
y = - line
truth = y[-1]
y = y[:-2] #target

I am currently using NUTS for sampling and it’s a lot slower on my model (1d, 22h 15m and running). To be fair, my model has highly correlated ‘explanatory’ variables. My model is using a multi-variable / multi-dimensional Gaussian process.

I have a question regarding ‘Metropolis’ in Gaussian process. Are you sure ‘Metropolis’ works at least as good as NUTS?. I would like to test it but i don’t want to spend another 2-3 days for unusable results. I would like to know your take on this.

I would strongly suggest you NOT to trust the results from metropolis. It offen gives a wrong answer in high dimensions. However, if you are happy with biased result and only want to explore one specific mode, I guess you can use metropolis and initialize your chain in a very specific region in the parameter space.

Thank you. I have one more question. I found out here that i may need to use variational inference methods if i have too many a data points.
I have a model with 12 explanatory variables and one dependent variable of length 550. I used a simple squared exponential covariance function with added white noise and zero mean to model. As explained earlier my Gaussian process model is too slow with NUTS sampling (2d + and running).

I tried using ADVI with init=‘advi+adapt_diag’, it is significantly faster initially compared to ‘jitter+adapt_diag’ but i cannot avoid the same slow sampling later. Do you know any variational inference examples other than the one given here.