 # Gaussian Process Regression with hyperpriors

My data:

I have a matrix of `N` predictors, with `t` observations for the model training:
`X (t x N)`
I have a target vector (the thing I would like to predict):
`y (t,)`
I have a vector of test cases for each predictor, in order to perform the forecast:
`X_new (1 x N)`

I would like to perform a multivariate Gaussian Process regression with hyperpriors `p(θ)`, whereby optimum hyperparameters `(θ = ℓ, a, σ)` (see code below) are first determined by Metropolis MCMC and then the optimum regression weight parameters `β` are determined (`y = βX + ε`) in order to perform a forecast `y_new = βX_new`. However I’m not quite sure if I am implementing this correctly in pyMC3…

So my understanding is that parameters `μ` and `Σ` (see code below) correspond to the mean and covariance of the model prior `p(β)`. I’m wondering, does my code below just take one random sample from the defined distributions of each of the hyperparameters and then perform MCMC sampling on the model posterior `p(β | y, X, θ)` to get the regression weights? Or does it perform MCMC sampling on the posterior of the hyperparameters `p(θ | y, X)` and the model posterior `p(β | y, X, θ)`? If it’s the former, how can I change my code to implement hyperpriors and optimise them with Metropolis MCMC, and finally perform a prediction of `y`?

Side note - the prior on my hyperparameters would be weakly informative. All I know is that they should all be > 0.

Thanks.

``````with pm.Model() as model:
ℓ = pm.Gamma('ℓ', alpha=1, beta=10, shape=X.shape)
a = pm.HalfCauchy('a', beta=10)
σ = pm.HalfCauchy("σ", beta=10)

μ = pm.gp.mean.Zero()

gp = pm.gp.Marginal(mean_func=μ, cov_func=Σ)
y_ = gp.marginal_likelihood('y_',X=X, y=y, noise=σ)

MAP = pm.find_MAP()
method = pm.Metropolis()
trace = pm.sample(1000, method, MAP, random_seed=1, progressbar=True)
func = gp.conditional('func', Xnew=X_new)
fmean, fvar = gp.predict(Xnew=X_new, point=MAP)``````

Think I figured it out in the end with a bit of playing around:

`````` with pm.Model() as model:
# hyperpriors
ℓ_mu = pm.Normal('ℓ_mu', mu=-10, sd=2)
ℓ_sd = pm.Gamma('ℓ_sd', alpha=10, beta=10)
a_mu = pm.Normal('a_mu', mu=-1, sd=2)
a_sd = pm.Gamma('a_sd', alpha=5, beta=10)
σ_mu = pm.Normal('σ_mu', mu=-1, sd=2)
σ_sd = pm.Gamma('σ_sd', alpha=5, beta=10)

# priors
logℓ = pm.Normal('log(ℓ)', mu=ℓ_mu, sd=ℓ_sd)
loga = pm.Normal('log(a)', mu=a_mu, sd=a_sd)
logσ = pm.Normal('log(σ)', mu=σ_mu, sd=σ_sd)
ℓ = pm.Deterministic('ℓ', tt.exp(logℓ)) #kernel length-scale
a = pm.Deterministic('a', tt.exp(loga)) #signal variance
σ = pm.Deterministic('σ', tt.exp(logσ)) #noise variance
μ = pm.gp.mean.Zero()

gp = pm.gp.Marginal(mean_func=μ, cov_func=Σ)
y_ = gp.marginal_likelihood('y_',X=X, y=y, noise=σ)
trace = pm.sample(draws=1000, chains=5, step=pm.Metropolis(), random_seed=1, progressbar=False)
func = gp.conditional('func', Xnew=X_new, pred_noise=True)
fmean, fvar = gp.predict(X_new, point=trace[-1], pred_noise=True)``````
1 Like

You could use a sklearn dataset maybe, e.g Boston house prices.

The data I have been using is zero-mean and typically N>n. I’ve made the Boston dataset here zero-mean also, however for the Boston data n>>N so I’m not to sure how the GP will perform. You may also need to adjust the hyper-priors (particularly the characteristic length scale parameter) as the parameters from my last comment are probably a bit dataset specific

``````from sklearn.datasets import load_boston
X = np.zeros((boston.data.shape))
#create zero-mean predictors
for N in range(boston.data.shape):
p = np.polyfit(np.arange(boston.data.shape),boston.data[:,N],1)
line = np.polyval(p, np.arange(boston.data.shape))
X[:,N] = boston.data[:,N] - line
X_new = X[-1,:].reshape(1,X.shape) #test case
X = X[:-2,:] #predictors
#create zero-mean target
p = np.polyfit(np.arange(boston.data.shape),boston.target,1)
line = np.polyval(p, np.arange(boston.data.shape))
y = boston.target - line
truth = y[-1]
y = y[:-2] #target``````

I am currently using NUTS for sampling and it’s a lot slower on my model (1d, 22h 15m and running). To be fair, my model has highly correlated ‘explanatory’ variables. My model is using a multi-variable / multi-dimensional Gaussian process.

I have a question regarding ‘Metropolis’ in Gaussian process. Are you sure ‘Metropolis’ works at least as good as NUTS?. I would like to test it but i don’t want to spend another 2-3 days for unusable results. I would like to know your take on this.

I would strongly suggest you NOT to trust the results from metropolis. It offen gives a wrong answer in high dimensions. However, if you are happy with biased result and only want to explore one specific mode, I guess you can use metropolis and initialize your chain in a very specific region in the parameter space.

Thank you. I have one more question. I found out here that i may need to use variational inference methods if i have too many a data points.
I have a model with 12 explanatory variables and one dependent variable of length 550. I used a simple squared exponential covariance function with added white noise and zero mean to model. As explained earlier my Gaussian process model is too slow with NUTS sampling (2d + and running).

I tried using ADVI with init=‘advi+adapt_diag’, it is significantly faster initially compared to ‘jitter+adapt_diag’ but i cannot avoid the same slow sampling later. Do you know any variational inference examples other than the one given here.