Coregion GP Very Slow Inference

I’m trying my hand a multiclass classification with GP’s. I already have a basic model that works where each GP if independent of the rest. I softmax their outputs to get the probability. I’m trying to build a ICM model, but pymc’s computation time has gone up to an unreasonable extent. EST for MCMC is around 500+ hours on a relatively good machine, SVDG doesn’t even initialize and even ADVI take dozens of hours. Is there any way to expediate the computations here? I don’t see why it’s so slow, save for the weird input shape required by the coreg kernel (I’ve had to duplicate all rows K times, for K categories to massage it the correct format). The following non ICM version works acceptably well:

# MGP withought coregionalization
with pymc.Model() as categorical_simple_model:
    factors=5
    N,M=X_train.shape
    gps=[]
    fs=[]
    G = 1e7
    σ_w = pymc.Normal(f'σ_w', mu=10., sigma=3, shape=M)
    σ_b= pymc.Normal(f'σ_b', mu=0.0, sigma=0.1)
    for factor,loc in  list(core_mappings[('origin', 'location', 'loc')].items()):
        μ = pymc.gp.mean.Constant(0)
        # μ_g = pymc.Normal(f'μ_g_{loc}', mu=10.0, sigma=1.5)
        # σ_g =pymc.Normal(f'σ_g_{loc}', mu=2.5, sigma=0.5)
        μ_g = 10.0
        σ_g = 1.0
        η = pymc.Normal(f'η_{loc}', mu=μ_g, sigma=σ_g)
        κ =  MultiLayerPerceptronKernel(M, variance=η**2, bias_variance=σ_b**2, weight_variance=σ_w**2+1e-4)
        gp = pymc.gp.Latent(mean_func=μ, cov_func=κ)
        gps.append(gp)
        f = gp.prior(f'f_{loc}'.format(loc=loc),
            X=x_tensor)
        fs.append(f)
    f = pymc.Deterministic('f', at.stack(*fs).T)
    p = pymc.Deterministic('p', at.nnet.softmax(f, axis=1))
    # y = pymc.Multinomial('y', p=p ,observed=Y_train.values, n=1)
    y = pymc.Categorical('y', p=p, observed=y_tensor)

The ICM version takes hundreds of hours to run:

with pymc.Model() as categorical_ICM:
    LOWER_BOUND_NON_NEGATIVE = 1e-4
    σ_n = pymc.TruncatedNormal('σ_n', mu=10.0, sigma=2, lower=LOWER_BOUND_NON_NEGATIVE, shape=x.shape[0], initval=(np.ones(x.shape[0])*1.0))
    k = pymc.TruncatedNormal('k', mu=10.0, sigma=2,lower=LOWER_BOUND_NON_NEGATIVE ,shape=factors)
    W = (-1)*pymc.TruncatedNormal('W', mu=10.0, sigma=2.0,lower=LOWER_BOUND_NON_NEGATIVE, shape=(factors,2) )
    σ_b = pymc.TruncatedNormal('σ_b', mu=3.0, sigma=1.0, lower=LOWER_BOUND_NON_NEGATIVE)
    σ_w = pymc.TruncatedNormal('σ_w',mu=3.0, sigma=0.8,lower=LOWER_BOUND_NON_NEGATIVE ,shape=x.shape[1])
    G = 1
    η = pymc.TruncatedNormal('η',mu=10, sigma=3.0, lower=LOWER_BOUND_NON_NEGATIVE)
    K = MultiLayerPerceptronKernel(x.shape[1],active_dims=[i for i in range(1,x.shape[1])],
                                  variance = G*(η+1.0), bias_variance = σ_b**2, weight_variance=σ_w**2)
    B = pymc.gp.cov.Coregion(X.shape[1], kappa=k, W=W, active_dims=[0])
    k_noise = pymc.gp.cov.WhiteNoise(σ_n)
    κ = (K+k_noise)*B
    μ = pymc.gp.mean.Constant(c=0)
    gp = pymc.gp.Latent(mean_func=μ, cov_func=κ)
    _f = gp.prior('_f', X=x_tensor)
    f = pymc.Deterministic('f', _f.reshape((-1, factors)))
    p = pymc.Deterministic('p', at.nnet.softmax(f, axis=1))
    y_obs = pymc.Categorical('y_obs',p=p ,observed=y_tensor)

I think I have a rough idea of what your MLP kernel is doing, but would it be possible to supply an end-to-end example here to help me understand this?

Not sure what you mean by end-to-end example but here’s some more info. As I mentioned, this is a mutliclass classification problem where I aim to softmax the MGP output, to turn it into a valid probability distribution. The MLP kernel was chosen rather accidentally as I tried many other kernels and combinations thereof, none of which worked (accuracy sub 50%). I chose it because it’s available in GPy, and after trying it at random, it is the only one that worked.
This worked quite well with scalar weight variance and bias variance approx 0, however I’m experimenting with vector weights (one per input dim) as this will hopefully make the model more interpretable.

I’ve tried running this model without the coregion kernel, which worked well but required setting the output variance to extremely high numbers- the highest machine precision would allow (bearing in mind the exponentials involved). If you’re asking about the actual implementation of the kernel, it is identical to the one in GPy with all relevant bits swapped with aesara ops

I just meant something that I can copy/paste into a Colab or similar environment and run so that I can see the types of data that are being fed in, the size of the data, etc. It looks like there’s a fair amount of complexity here between the MLP kernel and the coregionalization. It’s a very interesting use case and I’d like to have a runnable example to take a closer look.

In general, using a loop to construct independent GPs is a pretty inefficient process; you may want to use the Kronecker GP construction to instantiate one very large sparse GP and then perform operations on that vector output to help speed things up.

1 Like

Input X is a 500 x 63 matrix (all real, c values), Y is a categorical vector. To satisfy the kernel’s requirements for isotopic data, I’ve essentially had to np.tile the matrix, making it 5 times as large. The MLP code is adapted from this implementation. The loop is ironically almost 100x more efficient that the coreg model. I’ve also implemented it via the Kronecker kernel with weirder results. While sampling is just a slow, ADVI exhibits almost 10x less loss.
P.S. Categorical likelyhood seems to work only if Y is a vector

I would look at sampling this on a GPU with sample_numpyro_nuts or via nutpie on the CPU. They are generally much faster than pm.sample.

Also, what does your MultiLayerPerceptronKernel look like, and what happens if you swap it out for a built-in kernel?

Since multiple people asked, this is the exact implementation of the MLP kernel, ported from GPy source. I think it would be interesting to look into implementing the Arc and DNN family of kernels in pymc, since they seem to be very effective in highly non-linear cases. I’ll see if I can figure out how to use the numpyro sampler, suggested and try another kernel. I cannot share the dataset as it is part of ongoing chemical research. It’s about 500x63 all standardized and continuous.


class MultiLayerPerceptronKernel(pymc.gp.cov.Covariance):
    '''
        Multi layer Perceptron Kernel a.k.a Arcsine Kernel
        Code transplanted from `GPy.kern.src.mlp`. This kernel
        is designed to mimic the highly nonlinear processes 
        of a Neural Network and can be thought of as the limit
        of an A.N.N. with a single input layer as the number
        of neurons in the hidden layer goes to infinity. Then
        the N.N., with normal priors over the weights and biases
        becomes equivalent to a G.P. This Kernel is given by:
        
        .. math::

          k(x,y) = \\sigma^{2}\\frac{2}{\\pi }  \\text{asin} \\left
          ( \\frac{ \\sigma_w^2 x^\\top y+\\sigma_b^2}{\\sqrt{\\sigma_w^2x^\\top x +
          \\sigma_b^2 + 1}\\sqrt{\\sigma_w^2 y^\\top y + \\sigma_b^2 +1}} \\right )
          
        Args:
        -----

            - input_dims:int=1 := The number of input dimentions (columns) used for Covariance
            computations. Defaults to 1 (use the first column only).

            - active_dims:Optional[list[int]]=None := None or array-like specifying which input
            dimentions will be used in computing the covariance matrix. Can be specified as a
            a boolean vector or a vector of indices. Optional. Defaults to None, and the first
            `input_dims` columns are selected.

            - variance:Optional[float] := The output kernel variance scaling parameter, usually
            denoted `\eta`. Defaults to 1.0. Must be positive.

            - weight_variance:Optional[Union[float, np.typing.NDArray[float]]]=1.0. The variance
            of the weight in the A.N.N. Can be specified as either a scalar or a matrix of
            appropriate size (same as the first input matrix). Optional.  Defaults to 1.0.

            - bias_variance:Optional[Union[float, np.typing.NDArray[float]]]=1.0. The variance of
            the biases in the A.N.N. Can be either a scalar or a matrix of appropriate size. When
            computing covariance matrices (single input) must be a N-length vector of biases for a
            `N \times M` input matrix. When computing cross-covariance matrices, must be an
            `N \times N` matrix. Optional.  Defaults to 1.0.

            - ARD:bool=False := A(utomatic)R(elevance)D(etermination) flag. Unused and raises
            an error if switched. Optional. Defaults to False.
        
        
        Returns:
        -------
        
            - K:np.typing.NDArray := When called with the `.full(X)` or the `.full(X, Y)` method,
            computes and returns the covariance or cross-covariance matrices.
            
            - diag(K):typing.NDArray := When called with the `.diag(X)` or `.diag(X, Y)` method
            computes and returns the diagonal of the covariance or cross-covariance matrices
    '''
    four_over_tau = 2./np.pi
    
    def __init__(self,input_dims:int, active_dims:typing.Union[list[int], int]=None,
                 variance:float=1.0,
                 weight_variance:typing.Union[float, np.typing.NDArray]=1.0,
                 bias_variance:typing.Union[float, np.typing.NDArray]=1.0, ARD:bool=False):
        super(MultiLayerPerceptronKernel, self).__init__(input_dims, active_dims=active_dims)
        self.variance = variance
        self.weight_variance = weight_variance
        self.bias_variance = bias_variance
        if ARD:
            raise NotImplementedError("Automatic Relevance Determination, not implemented")
        self.ARD = ARD

    
    def _comp_prod(self, X, X2=None):
        if X2 is None:
            return (at.math.square(X)*self.weight_variance).sum(axis=1)+self.bias_variance
        else:
            return (X*self.weight_variance).dot(X2.T)+self.bias_variance
    
    def diag(self, X):
        """Compute the diagonal of the covariance matrix for X."""
        X_prod = self._comp_prod(X)
        return self.variance*MultiLayerPerceptronKernel.four_over_tau*at.math.arcsin(X_prod/(X_prod+1.))
    
    def full(self, X, X2=None):
        if X2 is None:
            X_denom = at.math.sqrt(self._comp_prod(X)+1.)
            X2_denom = X_denom
            X2 = X
        else:
            X_denom = at.math.sqrt(self._comp_prod(X)+1.)
            X2_denom = at.math.sqrt(self._comp_prod(X2)+1.)
        XTX = self._comp_prod(X,X2)/X_denom[:,None]/X2_denom[None,:]
        return self.variance*MultiLayerPerceptronKernel.four_over_tau*at.math.arcsin(XTX)

UPDATE: Swapping out the MLP kernel with the RBF kernel (which also removes around 3 RV’s) has had no effect in sampling time. MCMC is so slow, it takes over 10 hours to finish initializing and sample once.
It took me forever to figure out how to setup CUDA and realize you have to actually import sample_numpyro_nuts but the speedup is dramatic