Help with a Gaussian Process for very few data

I want to adjust with a Gaussian Process the relative humidity (RH) with height of only 6 points (stations). My goal is, after that, find the probable altitude of the line of 50% RH.

So, if one day at some hour I have (X is for RH, y for altitude):

X = np.array([32, 38, 94, 83, 99, 78])
y = np.array([1702, 1514, 683, 269, 900, 86])

I do:

with pm.Model() as d_model:
    # Specify the covariance function
    vert = pm.Exponential("vert", 0.5)
    l = pm.Gamma(name='l', alpha=7, beta=1)
    cov_func = vert**2 * pm.gp.cov.Matern32(1, ls=l) 

    # Specify mean
    mean_init = pm.gp.mean.Constant(c=500)

    # Specify the GP
    gp = pm.gp.Marginal(mean_init, cov_func=cov_func)

    # Place a GP prior over the function and do the noise:
    sigma = pm.HalfCauchy("sigma", beta=1)
    y_ = gp.marginal_likelihood("y_", X=X, y=y, noise=sigma)

    # MCMC:
    trace = pm.sample(10000, chains=3, tune=1000, target_accept=0.85)

Now, if I want to make a prediction for RH=50, among others:

with d_model:
    X_new = np.array([12, 50, 78])[:, None]
    f_p = gp.conditional("f_p", X_new)
    pred_samples = pm.sample_posterior_predictive(trace, var_names=["f_p"], samples=10)

In the plot (code omitted for brevity) I get a very poor adjustment. I have tried changing some parameters and changing some stuff but the adjustment does not go well.


Is it just a problem of few data and Gaussian Process should not be used here or is there some part regarding the definition of the Gaussian Process I don’t get? Thx.

(Using PyMC3)

I think your issue here is priors and the scaling of the data. Your vert parameter controls the “amplitude” of the GP. For your data it should probably be on the order of 100 - 1000s. sigma controls the noise, so this is also probably in 100’s maybe. I’d recommend rescaling your y data, since this also helps with numerical stability, and then looking a bit more carefully at the priors for vert and sigma too.

Especially for such small data, using strongly informative priors becomes much more important! It will help with divergences too.

Here’s a modified version of your code that gives a more sensible output that might help you get started:

X = np.array([32, 38, 94, 83, 99, 78])
y = np.array([1702, 1514, 683, 269, 900, 86])
y_sc = (y - np.mean(y)) / np.std(y)

with pm.Model() as d_model:
    # Specify the covariance function
    vert = pm.HalfNormal("vert", sigma=1.0)
    l = pm.Gamma(name='l', alpha=7, beta=1)
    cov_func = vert**2 * pm.gp.cov.Matern32(1, ls=l) 

    # Specify mean
    c = pm.HalfNormal('c', sigma=1.0)
    mean_init = pm.gp.mean.Constant(c=c)

    # Specify the GP
    gp = pm.gp.Marginal(cov_func=cov_func)

    # Place a GP prior over the function and do the noise:
    sigma = pm.HalfNormal("sigma", sigma=1)
    y_ = gp.marginal_likelihood("y_", X=X[:, None], y=y_sc, noise=sigma)

    # MCMC:
    trace = pm.sample(500, chains=2, tune=1000, target_accept=0.95)
    
    
fig = plt.figure(figsize=(15, 5))
ax = fig.gca()

f_p = pred_samples.posterior_predictive['f_p'].stack(samples=['chain', 'draw'])
pm.gp.util.plot_gp_dist(ax, f_p.values.T, X_new.flatten());
plt.plot(X, y_sc, 'o', color='dodgerblue', alpha=0.5, markersize=20);

I used PyMC v4 though, so it’ll need a couple small changes to work with PyMC3.

3 Likes

Thank you very much. Let me ask you 2 questions:

  1. I see you make a standardization of the “Y” variable but not of the “X”. Would a standardization in X improve the results? (In theory, when you are taught standardization you standardize everything).
  2. If you do a sample of 500 with 2 chains but a tune of 1000, you are using all the sample for tuning (2x500), am I right? Shouldn’t you give a bigger number than 500 for sampling? Thx.

Of course!

  1. It might, it definitely wouldn’t hurt, but I’ve noticed things are a bit more sensitive to the Y scaling. If the scale of X is very far from [-3, 3] it could help. I do try to avoid standardizing X because it makes the lengthscale parameters a bit harder to set priors for and interpret for GPs (though you could easily rescale the lengthscales after). On the other hand, standardizing Y makes it easier (for me at least) to set priors for things like the overall amplitude (vert) and the different noise sources, since you know something like pm.HalfNormal('eta', sigma=1) will probably give reasonable results as a starting point.

  2. So the default is 1000 steps for tuning, and then the 500 is the draws after tuning. So here it ran 3000 samples total, dropped the first 1000 of each chain, and returned the last 500 of each chain. Definitely always better to use a bigger number of draws, but for debugging / prototyping it’s better to iterate faster, so that’s the only reason I cut the number of samples.

Hope that helps

1 Like