Gaussian Processes and Unusual Noise

I am trying to model a process, which looks a bit like a function, with heavy-tailed noise, and some values are also multiplied by a random value between 0 and 1. Consider the following example data.

rng = np.random.default_rng(4)
x = np.linspace(-3, 3, 50)
y = 5 + np.sin(x)
noise = 0.2 * rng.standard_t(df=1.5, size=x.shape[0])
# multiply 20% of the values by a random number from 0-1
eff = np.minimum(5 * rng.random(size=x.shape[0]), 1)
obs = (y + noise) * eff

I can leverage gp.Latent and StudentT to capture the underlying function quite well. As per the following:

with pm.Model() as model:
    # Underlying GP
    ls = pm.Gamma("ls", mu=3, sigma=1)
    scale = pm.Gamma("scale", mu=1, sigma=0.5)
    cov = scale ** 2 * pm.gp.cov.ExpQuad(input_dim=1, ls=ls)
    gp = pm.gp.Latent(cov_func=cov)
    f = gp.prior("f", x[:, None])

    # Noise
    offset = pm.Normal("offset", mu=5, sigma=2)
    nu = 1 + pm.Gamma("noise_nu", mu=1, sigma=1)
    sigma = pm.HalfNormal("noise_sigma", sigma=1)
    pm.StudentT("noisy_f", nu=nu, mu=offset + f, sigma=sigma, observed=obs)

    # Sample
    trace = pm.sample()

x_new = np.linspace(-4, 4, 100)
with model:
    f_pred = gp.conditional("f_pred", x_new[:, None])
    pm.StudentT("noisy_f_pred", nu=nu, mu=offset + f_pred, sigma=sigma)
    ppc = pm.sample_posterior_predictive(trace, var_names=["f_pred", "noisy_f_pred"])

However, though this approach captures the underlying function well, it overestimates the noise variance by quite a bit and obviously doesn’t capture the random drops in “efficiency” (i.e. the multiplications by 0-1)

How can I model this process more accurately?

I’ve tried a couple of things but they always lead to strange or completely wrong results. I’m interested in creating a generative model so I can sample other similar-looking data.

SIDE-NOTE: I am loving the new HSGP class and the ability to leverage prior_linearized to treat GPs like a regular linear model and get rapid inference…but didn’t want to complicate my example above by including it.

1 Like

For example, I tried adding this after the definition of f, which does seem to work for the above i.e. the estimated noise variance is a little less, as expected.

eff = pm.Uniform("efficiency", lower=0, upper=5, size=x.shape)
f *= pm.math.clip(eff, 0, 1)

For completeness, here is my HSGP implementation…which I am struggling to make converge.

with pm.Model() as model:
    # Mutable input data
    X_data = pm.MutableData("X", x[:, None])
    obs_data = pm.MutableData("y", obs)

    # Underlying GP
    ls = pm.Gamma("ls", mu=3, sigma=1)
    scale = pm.Gamma("scale", mu=1, sigma=0.5)
    cov = scale ** 2 * pm.gp.cov.ExpQuad(input_dim=1, ls=ls)

    gp = pm.gp.HSGP(m=[50], L=[4], cov_func=cov)
    phi, sqrt_psd = gp.prior_linearized(X_data)  # already centred
    beta = pm.Normal("beta", size=gp._m_star)
    f = pm.Deterministic("f", phi @ (beta * sqrt_psd))

    # Efficiency
    eff = pm.Uniform("eff", lower=0, upper=5, size=X_data.shape[0])
    f *= pm.math.clip(eff, 0, 1)

    # Noise
    offset = pm.Normal("offset", mu=5, sigma=2)
    nu = 1 + pm.Gamma("noise_nu", mu=1, sigma=1)
    sigma = pm.HalfNormal("noise_sigma", sigma=1)
    pm.StudentT("noisy_f", nu=nu, mu=offset + f, sigma=sigma, observed=obs_data)

    # Sample
    trace = pm.sample(target_accept=0.98)


x_new = np.linspace(-4, 4, 100)
with model:
    pm.set_data({"X": x_new[:, None], "y": np.zeros(x_new.shape)})
    ppc = pm.sample_posterior_predictive(trace, var_names=["noisy_f"])

However, I am still interested to hear if this is the right approach to this problem? Or how I might get the HSGP implementation to converge :slight_smile:

The approach with clip is a neat idea, but unfortunately this is makes the posterior as seen by the sampler quite tricky. If the eff parameter is larger than 1 the derivative will be zero, and the sampler can’t know that making it smaller will eventually change the logp. In general it is good to avoid any functions that don’t have a well defined derivative everywhere in the parameter space, or that aren’t analytic.

Effectively you have a discrete parameter for each of the observations, that tells us if that value needs to be scaled down. So you could code it along those lines:

scale_down = pm.Bernoulli("scale_down", p=0.2, dims="observation")
eff = pm.Uniform("eff", lower=0, upper=5, dims="observation")
mu = offset + scale_down * eff * f

This is still not a great solution, because now we have discrete parameters in the model, and we can’t use NUTS for those variables.
We can however do an extra step and marginalize out those variables. (I thought there was a tutorial about that somewhere in the docs, but I can’t find anything right now. Does anyone else know a good one?)

Something like this should do it (I didn’t test it, but hopefully you get the idea, if not please ask):

pm.Mixture(
    "noisy_f",
    [0.2, 0.8],
    [
        pm.StudentT.dist(nu=nu, mu=offset + eff * f, sigma=sigma),
        pm.StudentT.dist(nu=nu, mu=offset + f, sigma=sigma)
    ],
    observed=observed,
)
1 Like

There are definitely examples that people have posted here (e.g., this old example), but I’m not sure there is a clean, simple example in the notebook gallery/documentation.

1 Like

https://www.pymc.io/projects/experimental/en/latest/generated/pymc_experimental.marginal_model.MarginalModel.html#pymc_experimental.marginal_model.MarginalModel

but in this case it’s equivalent to the mixture

Back to clipping, Tensorflow has a sort of differentiable clip operation that could be interesting tfp.bijectors.SoftClip  |  TensorFlow Probability

1 Like

This is very helpful. Thank you! Let me give it a go.

Ok, I tried this approach (including a prior on the efficiency drop ratio):

    # Efficiency
    drop = pm.Uniform("drop", lower=0, upper=0.3)
    eff = pm.Uniform("eff", lower=0, upper=1, size=f.shape)

    # Likelihood
    pm.Mixture(
        "noisy_f",
        [drop, 1 - drop],
        [
            pm.StudentT.dist(nu=nu, mu=offset + f * eff, sigma=sigma),
            pm.StudentT.dist(nu=nu, mu=offset + f, sigma=sigma)
        ],
        observed=obs_data,
    )

And this seems to work well for the standard GP case. However, again when I try for HSGP the model very quickly completely diverges with an rhat of inf.

Do you have any suggestions how I could address this? Maybe narrowing the priors?

I will note that whilst I can’t get the HSGP implementation to work for this toy problem. The approach does seem to be working for the real problem I’m working on. So thanks a lot everyone! This was really instructive.