Using a GP to separate the contributing effects to height data

I have a data set of (x, y, z), height data. By eye, I can see that the height values follow some periodic pattern on top of some smoothly varying pattern.

I am using pymc 5.0.1

I want to be able to separate out the periodic pattern from the smooth pattern and I am hoping a gaussian process can help me do this.
Because of domain knowledge, I know that one thing is causing the smoothly varying portion of the data and another factor is causing the periodic portion of the data. I want to be able to look at them separately.

I have constructed a toy dataset to help illustrate what I am trying to do in my work.

import pymc as pm
import numpy as np
import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt

# toy dataset setup
npoints = 50
bound = 10

xcoords = np.linspace(-bound, bound, npoints)
ycoords = np.linspace(-bound, bound, npoints)

xmesh, ymesh = np.meshgrid(xcoords, ycoords)

# height data is a quadratic portion added to a periodic portion
zmesh = (xmesh**2 + ymesh**2)/bound + np.cos(2*np.pi*xmesh / (bound/5))
noise = np.random.random(zmesh.shape)*(.1*bound)

z = (zmesh+noise).flatten()

xy = np.c_[xmesh.flatten(), ymesh.flatten()]
z = (z - z.mean()) / z.std()

data looks like:

I have been going over the examples in https://github.com/pymc-devs/pymc-examples/tree/main/examples/gaussian_processes and this is what I came up with for modelling:

with pm.Model() as model:
    # primary effects
    ls = pm.Gamma('ls', alpha=2, beta=1, shape=2)
    eta= pm.HalfCauchy('eta', beta=5)

    cov_primary = eta**2 * pm.gp.cov.Matern52(2, ls=ls)
    # gp_primary  = pm.gp.Marginal(cov_func=cov_primary)


    # periodic effects 
    ls_periodic = pm.Gamma('ls_periodic', alpha=2, beta=1,  shape=2)
    tau_periodic= pm.Gamma('tau_periodic', alpha=2, beta=1, shape=2)
    eta_periodic= pm.HalfCauchy('eta_periodic', beta=5)

    cov_periodic = eta_periodic**2 * pm.gp.cov.Periodic(2, period=tau_periodic, ls=ls_periodic)
    # gp_periodic  = pm.gp.Marginal(cov_func=cov_periodic)

    # overall 
    # gp = gp_primary + gp_periodic
    gp = pm.gp.Marginal(cov_func=cov_primary + cov_periodic)

    sigma = pm.HalfCauchy("sigma", beta=5)
    z_ = gp.marginal_likelihood('z', X=xy, y=z, sigma=sigma)

with model:

    mp = pm.find_MAP(method='BFGS')

And then I have to use .find_map() because sampling takes way too long.
I look at how well it did by expanding the xy window and using gp.predict()

xcoords_new = np.linspace(-2*bound, 2*bound, 4*npoints)
ycoords_new = np.linspace(-2*bound, 2*bound, 4*npoints)

Xnew = pm.math.cartesian(xcoords_new[:, None], ycoords_new[:, None])

with model:
    mu, var = gp.predict(Xnew, point=mp, diag=True)

cmap = 'rainbow'
norm = mpl.colors.Normalize(vmin=z.min())
plt.imshow(mu.reshape(4*npoints, 4*npoints).T,norm=norm, cmap=cmap)

The resulting image doesn’t look too bad compared to the true data.
My question is:
How do I plot just the contribution from the periodic portion?
I tried:

with model:
    a = np.random.multivariate_normal(pm.gp.mean.Zero()(xy).eval(), cov_periodic(xy).eval())
plt.imshow(a.reshape(npoints,npoints), cmap='rainbow')

But that does not look like a cosine wave dependent only on x. Is that the correct way to extract the periodic portion? I am unsure if the result is because of a poor model or if I am trying to extract the periodic portion incorrectly.

2nd Question:
is this an appropriate application of gaussian processes?
I realize I may be using the wrong tool for the job. Also, my actual task will have a few orders of magnitude more datapoints and pymc took a while with just a 30x30 grid so I am worried I won’t be able to scale this up for my actual problem.

I will try to include plots of the data and output later. My work firewall is blocking uploads.

I hope I understood your problem correctly. From what I can collect, you’re trying to sample a marginal GP over the coordinates expanded over a range of samples or time-series (assuming the period goes over the 50 coordinates 50 times), this explains the slow sampling. If you want to infer the influence of the factors on z variable, then you may want to use a latent GP instead. I have drawn a lot of inspiration for this type of problem from this post: Modeling spatial data with Gaussian processes in PyMC - PyMC Labs . So, if I understood correctly, maybe this re-version of your model could help:

import pymc as pm
import numpy as np
import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import arviz as az

# toy dataset setup

np.random.seed(27)

npoints = 50
bound = 10

xcoords = np.linspace(-bound, bound, npoints)
ycoords = np.linspace(-bound, bound, npoints)

xmesh, ymesh = np.meshgrid(xcoords, ycoords)

# height data is a quadratic portion added to a periodic portion
zmesh = (xmesh**2 + ymesh**2)/bound + np.cos(2*np.pi*xmesh / (bound/5))
noise = np.random.random(zmesh.shape)*(.1*bound)

z = (zmesh+noise).flatten()

xy = np.c_[xmesh.flatten(), ymesh.flatten()]
z = (z - z.mean()) / z.std()

period_labels = np.array(["p"+str(p) for p in range(len(xcoords))])
period_idx = np.repeat([p for p in range(len(xcoords))], xy.shape[0]/len(xcoords))
samples_idx = np.arange(xy.shape[0])

coords = {"samples":samples_idx, "period":period_labels, "feature":["xcoor", "ycoor"]}

X = np.array([xcoords,ycoords]).T

with pm.Model(coords=coords) as model:
    t_idx = pm.ConstantData("period_idx", period_idx, dims="samples")
    Xs = pm.ConstantData("Xs", X, dims=("period", "feature"))
        
    l = pm.Gamma("l", mu=100, sigma=50)
    k = pm.gp.cov.Matern52(input_dim=2, ls=l)
    latent_s = pm.gp.Latent(cov_func=k,)
    gamma_z = latent_s.prior("gamma_z", Xs, dims="period")
    gamma_l = pm.Normal("gamma_l", 0, 1)
    gamma_s = pm.HalfNormal("gamma_s", 1)
    gamma = pm.Deterministic("gamma", gamma_l + gamma_s*gamma_z, dims="period") #spatial GP
    
    epsilon = pm.HalfNormal("epsilon", 1)
    
    y = pm.Normal("y", mu=gamma[t_idx], sigma=epsilon, observed=z, dims="samples") 
    
with model:
    idata = pm.sample(1000, tune=1000, cores=4, target_accept=0.9, random_seed=27)

summ = az.summary(idata, var_names=['l', "gamma_l", "epsilon"])
print(summ)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [l, gamma_z_rotated_, gamma_l, gamma_s, epsilon]
 |████████████| 100.00% [8000/8000 16:41<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1023 seconds.

           mean     sd  hdi_3%  hdi_97%  ...  mcse_sd  ess_bulk  ess_tail  r_hat
l        23.355  5.580  13.520   34.120  ...    0.084    2209.0    2643.0    1.0
gamma_l   0.722  0.879  -0.873    2.448  ...    0.012    4077.0    2392.0    1.0
epsilon   0.723  0.010   0.705    0.744  ...    0.000    6613.0    2555.0    1.0

[3 rows x 9 columns]

Maybe you could expand this to some sort of LGCP model, or something that could allow you to separate contributions of temporal and spatial background processes. I hope it’s useful.

1 Like