# Bayesian model calibration with Gaussian Process

I am attempting to implement bayesian model calibration under the classical Kennedy-O’Hagan framework using PyMC3. Briefly, the setup is as follows:

I have n observations z_i of the form

z_i = f(x_i, \theta) + g(x_i) + e_i,

where x_i are known imputs and \theta are unknown calibration parameters and e_i are iid error terms. We also have m model evaluations y_j of the form

y_j = f(x^*_j, \theta_j),

where both x^*_j (different than x_i above) and \theta_j are known. Therefore, the data consists of all z_i and y_j. In the paper, Kennedy-O’Hagan model f, g using gaussian processes:

f \sim GP\{m_1(.,.), \Sigma_1[(.,.),(.,.)] \}

g \sim GP\{m_2(.), \Sigma_2[(.),(.)] \}.

Among other things, the goal is to get posterior samples for the unknow calibration iparameters \theta. As I said, I am trying to implement the solution using Python and PyMC3 in particular.

I have the following challenges:

• How to define the model in PyMC3 in the first place? I know how to do it ,when I have one GP for the whole data set or sum of GP for the whole data set. In this case, part of the data follows one model, another part follows another model.

• \theta is an input into the covariance function for f. There does not seem any streight forward way to do this in PyMC3 as well.

I will be very grateful for any tips or ideas how to solve the above mentioned problems!

If I understand correctly, you have x_i as input and two different GP evaluate on part of x_i which gives z_i and y_j right? You can partition x_i into two part and model two GP within a pm.Model() block. Moreover, you can specify \theta one and input them into each GP if they are the same for both GP.

Not sure I understand why you said \theta is not straight forward to model - are they regular covariance function parameters like lenght scale etc?

1 Like

I have a different x_j 's that give y_j. Sorry for the mistake, I corrected it in the innitial question. I have still hard time to see how to exactly define it within the pm.Model() block. The gaussian processes also depend on some parameters \gamma which are in the case of f shared across z_j and y_j. I guess my question is, if I do something like this:

'gp1 = pm.gp.Marginal(mean_func1, cov_func1)
gp2 = pm.gp.Marginal(mean_func2, cov_func2)
gp3 = gp1 + gp2

f_1 = gp1.marginal_likelihood(“f_1”, X_1, y_1, noise)
f_2 = gp3.marginal_likelihood(“f_2”, X_2, y_2, noise)

will the inference be done properly?

In the case of \theta, it is a vector parameter that is not the parameter of the covariance functions but it plays a role of x in the covariance matrix. i.e. I have a input (\theta, x_i) for each z_i (same \theta for each)

I am far from an expert on Pymc3 but does it make sense to stack your x_i and \theta values into one large X vector to pass into your GP? That way all of your x_i and \theta values are inputs.

Something like

import theano.tensor as tt

n = number_of_datapoints
theta1_vals = np.random.normal(loc=theta1_mu,scale=theta1_sd,size=n)
theta2_vals = np.random.normal(loc=theta2_mu,scale=theta2_sd,size=n)
.
.
.

thetas = zip(theta1_vals,theta2_vals,...)

y1 = [some_function(x,theta1,theta2) for theta1,theta2,... in thetas]

X = np.vstack([np.hstack([x,theta]) for theta in thetas])

with pm.Model() as model:
.
.
.

f1 = gp1.marginal_likelyhood('y',X = X, y=y1,noise=sigma_1)
.
.
.


Another thought, if you never need to vary x_i after you have created your training set maybe your gp only needs to be a function of \theta even though x_i was used in its creation. Could you just do:

import theano.tensor as tt

n = number_of_datapoints
theta1_vals = np.random.normal(loc=theta1_mu,scale=theta1_sd,size=n)
theta2_vals = np.random.normal(loc=theta2_mu,scale=theta2_sd,size=n)
.
.
.

thetas = zip(theta1_vals,theta2_vals,...)

y1 = [some_function(x,theta1,theta2) for theta1,theta2,... in thetas]

X = np.vstack(thetas)

with pm.Model() as model:
.
.
.

f1 = gp1.marginal_likelyhood('y',X = X, y=y1,noise=sigma_1)
.
.
.


Your gaussian process model would be len(y) long so you never need to use x_i when you call it. The first approach would let you make x_i uncertain though which I could see being helpful sometimes.

1 Like

You can run some simulation to see if it can recover the parameters. But it sounds like a valid use case to me.

Hmm, I am not sure how this solves the problem of need for something like:

with pm.Model() as model:
theta = pm.MvNormal("theta", mu = mean, cov = covariance)

cov_func = pm.gp.cov.ExpQuad(len(mu) + X.shape, lengthscale)

gp = pm.gp.Marginal(cov_func = cov_func)

#Now I need to perfrom something that stacks all (theta,x). Lets call this X1
f1 = gp.marginal_likelihood('y', X = X1, y = y, noise = sigma)


I do need to perform inference about unknown RV \theta and get its posterior distribution.

It looks like this works Thanks! However, I am still strugling with theano RV \theta as an input (not a parameter) into a cov function of a GP.

The cov function in PyMC3 accepts numpy array, theano tensor, or pymc3 RV as input. You can only do inference on the RVs in general (with some twist you can do it on theano tensor as well)

Maybe I’m misunderstanding (I haven’t read the Kennedy paper) but your \theta values are your parameter values in your more complicated function correct? If you never vary x_i but treat it as known, why do you need it as an input to your gaussian process? Isn’t that information already included in your y vector? i.e.

y = np.array([some_function(x,thetas) for thetas in theta_vals]) #(A n x len(x) matrix)


So now your gp is really only a function of your \theta values which would then be your X?

Please disregard if I’m still not getting it.

I am sorry, I reallized that I might not have been the most clear. I will try to elaborate. I am trying to model something like this

z_i = f(x_i, \theta) + e_i
y_j = f(x^*_j, \theta^*_j) +e_j

where

f \sim GP\{m(.,.), \Sigma[(.,.)(.,.)]\}

with

m(x,\theta) = x^T \theta

and

\Sigma[(x,\theta)(x',\theta')] = e^{-\sum (\frac{\theta_i - \theta'_i}{\gamma_i})^2 -\sum (\frac{x_j - x'_j}{\eta_i})^2 }.

And my observed responses are D = (z_1, \dots, z_n, y_1, \dots, y_m) with covariates \{(x_1, \theta), \dots, (x_n,\theta), (x^*_1, \theta^*_1), \dots, (x^*_m, \theta^*_m)\}. Where I have x_i, x^*_j, \theta^*_j as known fixed values (it is part of my data) and unknow calibration parameters \theta.

My goal is to do inference abut all \gamma_i, \eta_j, \theta and get their posterior distributions. So that is why I need to use unknown \theta as an input into a covariance function, and that is what I am struggling with.

Oh okay. I’m afraid I’m well out of my depth there then. I will be watching this thread with interest to see if you arrive at a solution.

I found a link where someone was working on your same problem with pymc3 and sklearn:
https://github.com/pymc-devs/pymc3/issues/1574

1 Like

@NateAM thanks! This seems to be helpful. It does look rather like a work-around though. I am still wondering if there is more straightforward solution with PyMC3. Any suggestions @junpenglao?

That post is a bit dated - we now implemented a full feature GP class which makes these problem much easier to model.
I think it would help if you could write down the generation process and simulation some data. There are many ways to model what you describe above, and if there are difficulties using the GP API directly, there are usually ways to code the model differently

Here is a little toy simulation of what I am trying to code:

import numpy as np
import pymc3 as pm

np.random.seed(123)
def black_box(input):
'''Represents the expensive model f which I am trying to calibrate'''
l_scale = np.array([1, 1])
beta = 1

cov_mat = cov_fcn(input).eval()
mean = np.repeat(beta, len(input)) #For simplicity, the mean is constant

return np.random.multivariate_normal(mean, cov_mat).flatten()

x1 = np.random.randn(10)
x2 = np.random.randn(15)
theta_star = np.random.randn(15)
theta_truth = 2

input_1 = np.array([[x_1, theta_truth] for x_1 in x1])
input_2 = np.array([[x2[i], theta_star[i]] for i in range(len(x2))])

#y1 represents observations for known imputs x1 and unknown calibration parameters
y1 = black_box(input_1)
#y2 represents model runs for known x2 and theta_star
y2 = black_box(input_2)

with pm.Model() as toy_model:

#Priors
theta = pm.Normal("theta", mu = 0, sd = 1)
beta = pm.Normal("beta", mu = 0, sd = 1)
l_scale = pm.Gamma("l_scale", alpha= 1, beta = 5, shape = 2)
σ = pm.HalfCauchy("σ", beta=5)

#Mean and cov of the GP
mean_fcn = pm.gp.mean.Constant(beta)

#This chunk is where it fails###################
input_1 = np.array([[x_i, theta] for x_i in x1])
################################################

gp1 = pm.gp.Marginal(mean_func = mean_fcn, cov_func = cov_fcn)
gp2 = pm.gp.Marginal(mean_func = mean_fcn, cov_func = cov_fcn)

y_1 = gp2.marginal_likelihood("y1", X=input_1, y=y1, noise=σ)
y_2 = gp2.marginal_likelihood("y2", X=input_2, y=y2, noise=σ)


This gives me the following error:
AsTensorError: ('Cannot convert [[-1.0856306033005612 theta]\n [0.9973454465835858 theta]\n [0.28297849805199204 theta]\n [-1.506294713918092 theta]\n [-0.5786002519685364 theta]\n [1.651436537097151 theta]\n [-2.426679243393074 theta]\n [-0.42891262885617726 theta]\n [1.265936258705534 theta]\n [-0.8667404022651017 theta]] to TensorType', <class 'numpy.ndarray'>)

Any Idea how to make PyMC3 accept something like:

theta = pm.Normal("theta", mu = 0, sd = 1)
input_1 = np.array([[x_i, theta] for x_i in x1])

This is a use case where PyMC3 should really work well! The GP functions can definitely handle this. You can find a somewhat similar example here, where a flat prior is placed over the inducing point locations, which are input to a covariances function.

Is there a different theta for each x_i? or is it the same theta? (is it a scalar or a column vector?) If it’s a scalar, maybe try something like (you’ll have to double check these shapes)

theta = pm.Normal("theta", mu=0.0, sd=1.0)
theta_col = tt.alloc(theta, len(x1))
input_1 = tt.concatenate((x1[:,None], theta_col[:,None], axis=1)


Also, if by chance what you’re working on is open source / online id love to see a link!

2 Likes

Yep, the key is to stack the theano scaler with a numpy array. For example another way to do is

import theano
import theano.tensor as tt
x1shared = theano.shared(x1[:, None])
with pm.Model():
...
input_1 = tt.concatenate([x1shared, tt.tile(theta, len(x1), ndim=2).T], axis=1)
...

2 Likes

@bwengals and @junpenglao thank you! This actually works well. @bwengals the theta is same for each x_i so what @junpenglao wrote works perfect! In general \theta \in \mathbb{R}^d, but I have already figured out how to modify the code so it works for a vector as well.

The mean function of my GP f is also a little bit peculiar. It looks like this:
m(x, \theta) = c(x) + \theta^Tv(x),
where c(x) is a constant and v(x) is a vector in \mathbb{R}^d.

I am implementing this by adding column to input_2, since I can calculate it explicitly for [x2, theta_star] and then having a linear mean function for gp2 with coeff = [1, 0, ...,0] and intercept = 0. I will also let the covariance function act only on the appropriate dimensions.

In the case of gp1, I am planning to include v(x)^T as rows in the design matrix x1 and then have a linear mean function with coeff = tt.concatenate([theta, [0,...,0]]) again with covariance matrix acting on particular dimensions.

2 Likes

@kejzlarv @bwengals @junpenglao I’ve studied this post pretty closely trying to implement the same model as @kejzlarv did. I found good success on a basic problem with similar code to @kejzlarv but I think a slight modification is required to match the theory from the original paper.

First a summary of @kejzlarv has done:

We have some observations of data in y_i with inputs x_i that we believe we can fit with some function f(x_i,\theta) up to a discrepancy function g(x_i), this data is subject to noise e_i. We assume that

f(x_i,\theta) \sim \mathcal{GP}(m_1,c_1(\{(\cdot,\cdot),(\cdot,\cdot)\})) \\ g(x_i) \sim \mathcal{GP}(m_2, c_1(\cdot,\cdot))

If we stop here, then the posterior would be something like

p(\theta, \psi | X,y) \propto p(\theta)p(\psi) p(y|X,\theta,\psi)

where p(y|X,\theta,\psi) = f(x_i,\theta) + g(x_i) \theta are the parameters we are trying to calibrate and \psi are the hyperparameters of the Gaussian Processes. We can then sample this using pycm3’s awesome sampling framework to get the posterior and then marginalize to get the posterior of \theta.

This is where I think some additional work is required. The purpose of this calibration model is to calibrate a computer model to “real” data. To this end, y and X are actually the vectors y = [y_n,y_e]^T and X = [(x_n,t^*), (y_e,\theta)]^T where the subscripts “n” and “e” denote numerical and experimental respectively. We do not believe that the discrepancy function should apply to the experimental data, and we do not believe that the numerical data is subject to noise - simulations are almost always deterministic. If we then write the Gaussian Process covariance and mean for the concatenated data, it should be something like (taking h = f + g)

h \sim \mathcal{GP}(m_3,c_3(\{(\cdot,\cdot),(\cdot,\cdot)\}))\\ m_3 = [m_1 + m_2, m_1]^T \\ c_3(\{(\cdot,\cdot),(\cdot,\cdot)\}) = \begin{bmatrix} c_1(\{(x_n,t),(x_n,t)\}) & c_1(\{(x_n,t),(x_e,\theta)\})^T \\ c_1(\{(x_n,t),(x_e,\theta)\}) & c_1(\{(x_e,\theta),(x_e,\theta)\}) + c_2(\{(x_e,\theta),(x_e,\theta)\}) + \Sigma_{e} \end{bmatrix}

The important point here being the structure of the covariance function. This has proven quite difficult to code into pymc3, with my attempt shown here:

class KOH_Kernel(Covariance):

"""
Additive model of Kennedy & O'Hagen covariance kernel
that operates on X assuming that X is:
(N + n) x (M + m)
n - length of the data
M - length of the simulation input vector
m - length of the simulation calibration vector

c_eta is the covariance of the eta function
c_delta is the covariance of the discrepency delta function
and noise is the covariance of data, typically specified as a diagonal

theta is the unknown parameter vector
"""

def __init__(self,input_dim: int,
N: int,
M: int,
c_eta: Covariance,
c_delta: Covariance,
noise: Covariance,
theta:tt.tensor) -> None:

super().__init__(input_dim)
self.N = N
self.M = M
self.theta = theta
self.c_eta = c_eta
self._c_delta = c_delta
self._noise = noise
self._conditional = False

def _my_slice(self,X,Xs):

X = tt.as_tensor_variable(X)
if Xs is not None:
Xs = tt.as_tensor_variable(Xs)

return X, Xs

def c_delta(self,X:tt.tensor,
diag = False) -> tt.tensor:
"""
get the delta part of the covariance matrix
"""
if not diag:
_c_delta = tt.zeros((X.shape,X.shape))
tt.subtensor.set_subtensor(_c_delta[self.N:,self.N:],
self._c_delta(X[self.N:,:]) +self._noise(X[self.N:,:]))
return _c_delta
else:
return self._c_delta(X[self.N,:],diag = True) +\
self._noise(X[self.N,:],diag = True)

def _assign_calibration_parameter(self,X: tt.tensor) -> tt.tensor:
"""
asigns the calibration vector theta to the inputs X
"""
return tt.subtensor.set_subtensor(X[self.N:,self.M:],self.theta)

def diag(self,X:np.ndarray) -> tt.tensor:
"""
method for returning the diagonal of the covariance with significantly less
associated computational cost than computing the full covariance matrix
and taking the diagonal
"""
return tt.concatenate([self.c_eta(X,diag = True),
self.c_delta(X,diag = True)],axis = 0)

def full(self,X:np.ndarray,
Xs: np.ndarray) -> tt.tensor:

"""
the method full is required when inheriting form the Covariance class in pymc3
and is called by a __call__ method in the base class
"""

tX,_ = self._my_slice(X,Xs)
if not self._conditional:
tX = self._assign_calibration_parameter(tX)
return self.c_eta(X) + self.c_delta(X)
else:
return self.c_eta(X)


But using this method I get a lot of divergences and very slow convergence in the sampling even at target acceptance of 0.95. Additionally, I run into issues with the prediction, namely the “input dimension mismatch issue” where the prediction step appears to want my predictive values to be the same shape as my input values. (link to someone experiencing the same issue: "Input dimension mis-match" in basic model?). When I pass a 1D vector for y instead of a nx1 vector, the Cholesky decomposition fails.

So my question is this: Is there something in my covariance function that I am doing incorrectly in the covariance function implementation that pymc3 is not happy with, causing the slow sampling and divergence? Is there something wrong with the covariance function implementation that is causing the failure in the Cholesky decomposition?