Gaussian Processes: Compositional Data

Hello,

Wanted to reach out and see if any community members have attempted to apply GPs to compositional date of the form Nxd, where d is the dimension of counts and N is the number of observations. Here N_{i} would be the count of observations at time.

To me, this would be most accurately be described as a hierarchical model where we place a Dirichlet prior on a multinomial response:
y_i~Mutl(M_i,p_i)
where y_i is a corresponding row vector in the Nxd data matrix, and M_i and p_i are the corresponding counts and proportions

I am under the impression that i should use a GP to model the mean of the Dirichlet vector at a given time t. So I would have

(mu_alpha)  ~ GP(0,K)

where mu_alpha is the mean of my dirichlet vector, k is my spatio-temporal vector (I guess for the sake of understanding we can assume space and time kernels are seperable)

This paper seems to outline a similar procedure, however pages 6-7 with some relevant mathematical details are a little ambiguous as to what is happening (for instance, the vector mu on in the link function seems to come out of nowhere): https://arxiv.org/pdf/1903.05036. I do not have a geostats background, so perhaps there is also some terminology i’m missing.

In terms of trying to get on the right track: is my idea correct? Is it practical given that we may not be able to use some hierarchical gaussian process approximations? is there perhaps a more direct way of going about this?

Thanks!

I fed your question into our PyMC-bot, here’s its answer:

Applying Gaussian Processes to Compositional Data

Based on your question about applying Gaussian Processes (GPs) to compositional data, I can provide some guidance on your approach and implementation considerations in PyMC.

Your Proposed Approach

Your idea of using a hierarchical model with a Dirichlet prior on a multinomial response is a reasonable approach for compositional data. The structure you’ve outlined:

y_i ~ Mult(M_i, p_i)

Where you want to use a GP to model the mean of the Dirichlet vector at a given time t:

(mu_alpha) ~ GP(0, K)

This approach makes sense conceptually for modeling compositional time series data [1].

Implementation Considerations

1. Hierarchical Structure

Your hierarchical structure is similar to what’s shown in Document 1, where you’d use a GP to model the parameters of a Dirichlet distribution. This is a valid approach for compositional data that preserves the simplex constraint (proportions sum to 1) [1].

2. GP Approximation Methods

For practical implementation, you might want to consider using the Hilbert Space Gaussian Process (HSGP) approximation mentioned in Document 2. This would be particularly useful if:

  • Your dataset is large (many time points)
  • You need faster computation than a full GP would allow
  • Your input dimension is low (1D for time or 2D for space-time) [2]

The HSGP approximation reduces computational complexity from O(n³) to O(mn + m), where n is the number of data points and m is the number of basis vectors [2].

3. Link Function

For the link function between the GP and the Dirichlet parameters, you’ll need to ensure the parameters remain positive. A common approach is to use an exponential or softmax transformation of the GP outputs to maintain the constraints of the Dirichlet distribution [1].

4. Multiple GPs for Multiple Dimensions

Since your compositional data has d dimensions, you could model this using multiple GPs with a shared covariance structure. Document 3 mentions that you can have “multiple GPs in the same problem” and use ARD (Automatic Relevance Determination) for multidimensional inputs [3].

Implementation in PyMC

Here’s a sketch of how you might implement this in PyMC:

import pymc as pm
import numpy as np

# Assuming X is your input (time points) and Y is your compositional data
with pm.Model() as model:
    # GP hyperparameters
    length_scale = pm.InverseGamma("length_scale", alpha=2.0, beta=1.0)
    eta = pm.HalfNormal("eta", sigma=1.0)
    
    # Define covariance function
    cov_func = eta**2 * pm.gp.cov.Matern52(1, ls=length_scale)
    
    # For each dimension of your compositional data (except one)
    # We model d-1 dimensions since the last is determined by the constraint
    gps = []
    for i in range(d-1):
        gp = pm.gp.Latent(cov_func=cov_func)
        # Place GP prior
        f = gp.prior(f"f_{i}", X=X)
        # Transform GP output to ensure positivity
        alpha_i = pm.Deterministic(f"alpha_{i}", pm.math.exp(f))
        gps.append(alpha_i)
    
    # The last dimension is determined by constraint
    # Or you could model all dimensions and normalize
    
    # Combine to form Dirichlet parameters
    alpha = pm.math.stack(gps, axis=1)
    
    # Dirichlet-Multinomial likelihood
    for i in range(len(X)):
        pm.Multinomial(f"y_{i}", n=M[i], p=alpha[i]/alpha[i].sum(), observed=Y[i])

Practical Considerations

  1. Separability: Your assumption about separable space and time kernels is valid and commonly used. This can be implemented by taking the product of a spatial kernel and a temporal kernel [2].

  2. Approximation Methods: If computational efficiency is a concern, the HSGP approach from Document 2 would be beneficial. For very large datasets, you might also consider variational inference methods [2].

  3. Data Containers: Be aware that there have been issues using Data Containers with GPs in older versions of PyMC, as mentioned in Document 4. However, this may be resolved in newer versions [4].

Conclusion

Your approach is theoretically sound. The main challenges will be in the practical implementation, particularly in ensuring computational efficiency and proper transformation of the GP outputs to maintain the constraints of the Dirichlet distribution.

For implementation, I would recommend starting with a simplified version (perhaps with synthetic data) to ensure the model structure works as expected before scaling to your full dataset.

Thank you for your reply, super helpful so far-esp some of the code here is really helpful for setting it up from a programming perspective, which is sometimes the hardest part!

From a heuristics perspective, i guess the biggest gap in my understanding is how to motivate the kmean vector of the dirichlet vector. Obviously I could just look at ‘training points’ and consider a mean vector that is reasonably close to the observed, define a kronecker product for the cov and then i could play with different lengthscales along time and space and evaluate the log liklilohood of the model, but i’m not sure if there is a more principled way of doing so.

I’m also a bit unsure how to implement this using HSGPs, as the mean of the dirichlet vector is not zero/stationary (the compositional data has a trend and some periodicity to it)-to my understanding this disqualifies an hsgp approach.