Bayesian model calibration with Gaussian Process

@kejzlarv @bwengals @junpenglao I’ve studied this post pretty closely trying to implement the same model as @kejzlarv did. I found good success on a basic problem with similar code to @kejzlarv but I think a slight modification is required to match the theory from the original paper.

First a summary of @kejzlarv has done:

We have some observations of data in y_i with inputs x_i that we believe we can fit with some function f(x_i,\theta) up to a discrepancy function g(x_i), this data is subject to noise e_i. We assume that

f(x_i,\theta) \sim \mathcal{GP}(m_1,c_1(\{(\cdot,\cdot),(\cdot,\cdot)\})) \\ g(x_i) \sim \mathcal{GP}(m_2, c_1(\cdot,\cdot))

If we stop here, then the posterior would be something like

p(\theta, \psi | X,y) \propto p(\theta)p(\psi) p(y|X,\theta,\psi)

where p(y|X,\theta,\psi) = f(x_i,\theta) + g(x_i) \theta are the parameters we are trying to calibrate and \psi are the hyperparameters of the Gaussian Processes. We can then sample this using pycm3’s awesome sampling framework to get the posterior and then marginalize to get the posterior of \theta.

This is where I think some additional work is required. The purpose of this calibration model is to calibrate a computer model to “real” data. To this end, y and X are actually the vectors y = [y_n,y_e]^T and X = [(x_n,t^*), (y_e,\theta)]^T where the subscripts “n” and “e” denote numerical and experimental respectively. We do not believe that the discrepancy function should apply to the experimental data, and we do not believe that the numerical data is subject to noise - simulations are almost always deterministic. If we then write the Gaussian Process covariance and mean for the concatenated data, it should be something like (taking h = f + g)

h \sim \mathcal{GP}(m_3,c_3(\{(\cdot,\cdot),(\cdot,\cdot)\}))\\ m_3 = [m_1 + m_2, m_1]^T \\ c_3(\{(\cdot,\cdot),(\cdot,\cdot)\}) = \begin{bmatrix} c_1(\{(x_n,t),(x_n,t)\}) & c_1(\{(x_n,t),(x_e,\theta)\})^T \\ c_1(\{(x_n,t),(x_e,\theta)\}) & c_1(\{(x_e,\theta),(x_e,\theta)\}) + c_2(\{(x_e,\theta),(x_e,\theta)\}) + \Sigma_{e} \end{bmatrix}

The important point here being the structure of the covariance function. This has proven quite difficult to code into pymc3, with my attempt shown here:

class KOH_Kernel(Covariance):
    
    """
    Additive model of Kennedy & O'Hagen covariance kernel
    that operates on X assuming that X is:
    (N + n) x (M + m)
    n - length of the data
    M - length of the simulation input vector
    m - length of the simulation calibration vector
    
    c_eta is the covariance of the eta function
    c_delta is the covariance of the discrepency delta function
    and noise is the covariance of data, typically specified as a diagonal
    

    theta is the unknown parameter vector
    """
    
    def __init__(self,input_dim: int,
                      N: int,
                      M: int,
                      c_eta: Covariance,
                      c_delta: Covariance,
                      noise: Covariance,
                      theta:tt.tensor) -> None:
        
        super().__init__(input_dim)
        self.N = N
        self.M = M
        self.theta = theta
        self.c_eta = c_eta
        self._c_delta = c_delta
        self._noise = noise
        self._conditional = False

    def _my_slice(self,X,Xs):

        X = tt.as_tensor_variable(X)
        if Xs is not None:
            Xs = tt.as_tensor_variable(Xs)
        
        return X, Xs
    
    def c_delta(self,X:tt.tensor,
                     diag = False) -> tt.tensor:
        """
        get the delta part of the covariance matrix
        """
        if not diag:
            _c_delta = tt.zeros((X.shape[0],X.shape[0]))
            tt.subtensor.set_subtensor(_c_delta[self.N:,self.N:],
                                    self._c_delta(X[self.N:,:]) +self._noise(X[self.N:,:]))
            return _c_delta
        else:
            return self._c_delta(X[self.N,:],diag = True) +\
                   self._noise(X[self.N,:],diag = True)
        
    def _assign_calibration_parameter(self,X: tt.tensor) -> tt.tensor:
        """
        asigns the calibration vector theta to the inputs X
        """
        return tt.subtensor.set_subtensor(X[self.N:,self.M:],self.theta)
    
    
    def diag(self,X:np.ndarray) -> tt.tensor:
        """
        method for returning the diagonal of the covariance with significantly less
        associated computational cost than computing the full covariance matrix
        and taking the diagonal
        """
        return tt.concatenate([self.c_eta(X,diag = True),
                               self.c_delta(X,diag = True)],axis = 0)

    def full(self,X:np.ndarray,
                  Xs: np.ndarray) -> tt.tensor:
        
        """
        the method full is required when inheriting form the Covariance class in pymc3
        and is called by a __call__ method in the base class
        """
        
        tX,_ = self._my_slice(X,Xs)
        if not self._conditional:
            tX = self._assign_calibration_parameter(tX)
            return self.c_eta(X) + self.c_delta(X)
        else:
            return self.c_eta(X)

But using this method I get a lot of divergences and very slow convergence in the sampling even at target acceptance of 0.95. Additionally, I run into issues with the prediction, namely the “input dimension mismatch issue” where the prediction step appears to want my predictive values to be the same shape as my input values. (link to someone experiencing the same issue: "Input dimension mis-match" in basic model?). When I pass a 1D vector for y instead of a nx1 vector, the Cholesky decomposition fails.

So my question is this: Is there something in my covariance function that I am doing incorrectly in the covariance function implementation that pymc3 is not happy with, causing the slow sampling and divergence? Is there something wrong with the covariance function implementation that is causing the failure in the Cholesky decomposition?