Gaussian Process Regression with hyperpriors

wjgregory92 · July 14, 2019, 11:21am

My data:

I have a matrix of N predictors, with t observations for the model training:
X (t x N)
I have a target vector (the thing I would like to predict):
y (t,)
I have a vector of test cases for each predictor, in order to perform the forecast:
X_new (1 x N)

I would like to perform a multivariate Gaussian Process regression with hyperpriors p(θ), whereby optimum hyperparameters (θ = ℓ, a, σ) (see code below) are first determined by Metropolis MCMC and then the optimum regression weight parameters β are determined (y = βX + ε) in order to perform a forecast y_new = βX_new. However I’m not quite sure if I am implementing this correctly in pyMC3…

So my understanding is that parameters μ and Σ (see code below) correspond to the mean and covariance of the model prior p(β). I’m wondering, does my code below just take one random sample from the defined distributions of each of the hyperparameters and then perform MCMC sampling on the model posterior p(β | y, X, θ) to get the regression weights? Or does it perform MCMC sampling on the posterior of the hyperparameters p(θ | y, X) and the model posterior p(β | y, X, θ)? If it’s the former, how can I change my code to implement hyperpriors and optimise them with Metropolis MCMC, and finally perform a prediction of y?

Side note - the prior on my hyperparameters would be weakly informative. All I know is that they should all be > 0.

Thanks.

with pm.Model() as model:
    ℓ = pm.Gamma('ℓ', alpha=1, beta=10, shape=X.shape[1])
    a = pm.HalfCauchy('a', beta=10)
    σ = pm.HalfCauchy("σ", beta=10)

    μ = pm.gp.mean.Zero()
    Σ = a * pm.gp.cov.ExpQuad(input_dim=2,active_dims=np.arange(X.shape[1]),ls=ℓ)

    gp = pm.gp.Marginal(mean_func=μ, cov_func=Σ)
    y_ = gp.marginal_likelihood('y_',X=X, y=y, noise=σ)

    MAP = pm.find_MAP()
    method = pm.Metropolis()
    trace = pm.sample(1000, method, MAP, random_seed=1, progressbar=True)
    func = gp.conditional('func', Xnew=X_new)
    fmean, fvar = gp.predict(Xnew=X_new, point=MAP)

wjgregory92 · July 24, 2019, 9:50am

Think I figured it out in the end with a bit of playing around:

 with pm.Model() as model:
    # hyperpriors
    ℓ_mu = pm.Normal('ℓ_mu', mu=-10, sd=2)
    ℓ_sd = pm.Gamma('ℓ_sd', alpha=10, beta=10)
    a_mu = pm.Normal('a_mu', mu=-1, sd=2)
    a_sd = pm.Gamma('a_sd', alpha=5, beta=10)
    σ_mu = pm.Normal('σ_mu', mu=-1, sd=2)
    σ_sd = pm.Gamma('σ_sd', alpha=5, beta=10)

    # priors
    logℓ = pm.Normal('log(ℓ)', mu=ℓ_mu, sd=ℓ_sd) 
    loga = pm.Normal('log(a)', mu=a_mu, sd=a_sd)
    logσ = pm.Normal('log(σ)', mu=σ_mu, sd=σ_sd)
    ℓ = pm.Deterministic('ℓ', tt.exp(logℓ)) #kernel length-scale
    a = pm.Deterministic('a', tt.exp(loga)) #signal variance
    σ = pm.Deterministic('σ', tt.exp(logσ)) #noise variance
    μ = pm.gp.mean.Zero()
    Σ = a * pm.gp.cov.ExpQuad(input_dim=X.shape[1],active_dims=np.arange(X.shape[1]),ls=ℓ)

    gp = pm.gp.Marginal(mean_func=μ, cov_func=Σ)
    y_ = gp.marginal_likelihood('y_',X=X, y=y, noise=σ)
    trace = pm.sample(draws=1000, chains=5, step=pm.Metropolis(), random_seed=1, progressbar=False)
    func = gp.conditional('func', Xnew=X_new, pred_noise=True)
    fmean, fvar = gp.predict(X_new, point=trace[-1], pred_noise=True)

rosgori · July 24, 2019, 11:51pm

Could please add fake data? I want to run your code.

wjgregory92 · July 25, 2019, 10:11am

You could use a sklearn dataset maybe, e.g Boston house prices.

The data I have been using is zero-mean and typically N>n. I’ve made the Boston dataset here zero-mean also, however for the Boston data n>>N so I’m not to sure how the GP will perform. You may also need to adjust the hyper-priors (particularly the characteristic length scale parameter) as the parameters from my last comment are probably a bit dataset specific

from sklearn.datasets import load_boston
boston = load_boston()
X = np.zeros((boston.data.shape))
#create zero-mean predictors
for N in range(boston.data.shape[1]):
    p = np.polyfit(np.arange(boston.data.shape[0]),boston.data[:,N],1)
    line = np.polyval(p, np.arange(boston.data.shape[0]))
    X[:,N] = boston.data[:,N] - line
X_new = X[-1,:].reshape(1,X.shape[1]) #test case
X = X[:-2,:] #predictors
#create zero-mean target
p = np.polyfit(np.arange(boston.data.shape[0]),boston.target,1)
line = np.polyval(p, np.arange(boston.data.shape[0]))
y = boston.target - line
truth = y[-1]
y = y[:-2] #target

indian312 · August 27, 2019, 9:38pm

I am currently using NUTS for sampling and it’s a lot slower on my model (1d, 22h 15m and running). To be fair, my model has highly correlated ‘explanatory’ variables. My model is using a multi-variable / multi-dimensional Gaussian process.

I have a question regarding ‘Metropolis’ in Gaussian process. Are you sure ‘Metropolis’ works at least as good as NUTS?. I would like to test it but i don’t want to spend another 2-3 days for unusable results. I would like to know your take on this.

junpenglao · August 28, 2019, 6:10am

I would strongly suggest you NOT to trust the results from metropolis. It offen gives a wrong answer in high dimensions. However, if you are happy with biased result and only want to explore one specific mode, I guess you can use metropolis and initialize your chain in a very specific region in the parameter space.

indian312 · August 28, 2019, 9:39pm

Thank you. I have one more question. I found out here that i may need to use variational inference methods if i have too many a data points.
I have a model with 12 explanatory variables and one dependent variable of length 550. I used a simple squared exponential covariance function with added white noise and zero mean to model. As explained earlier my Gaussian process model is too slow with NUTS sampling (2d + and running).

I tried using ADVI with init=‘advi+adapt_diag’, it is significantly faster initially compared to ‘jitter+adapt_diag’ but i cannot avoid the same slow sampling later. Do you know any variational inference examples other than the one given here.

Topic		Replies	Views
Gaussian Processes: Sampling kernel hyperparameters? Questions	2	859	January 15, 2021
How to put a restriction on the value of hyper-parameters in Gaussian process Questions	4	837	October 30, 2020
What is the best way to define a multivariate regression in the context of hierarchical bayesian models? Questions	4	745	September 9, 2021
Multivariate Gaussian Process Classification (GPC) and Prediction Questions	2	893	February 23, 2020
Implementing ARD Regression Questions	9	1257	May 30, 2018

Gaussian Process Regression with hyperpriors

Related topics