Representing data from Gaussian process

I have implemented a gaussian process using my actual data points to generate predicted data points. This was done using the code below and is all functioning well.

data = pandas.read_csv('exported_stress_fsw_311.csv')
data = data[data['load']==0]
data = data[data['y']==0]
data = data[::2]

x = data['x'].to_numpy()

x = x[:,None]

y = data['s11'].to_numpy()

er = data['s11_std'].to_numpy()
X_new = numpy.linspace(-50, 50, 100)[:, None]

with pm.Model() as model:

    l = pm.Gamma("l", alpha=2, beta=1)

    n = pm.HalfCauchy("n",beta=5)

    cov = n ** 2 *,l)
    #Specify the GP, the default mean function is zero in this case as not specified
    gp =
    #Placing priors over the function
    s = pm.HalfCauchy("s",beta=5)
    #Marginal likelihood method fits the model
    y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s+er)
    #Find the parameters that best fit the data
    #mp = pm.find_MAP()
    trace = pm.sample()
    #.conditional distribition for predictions given the X_new values 
    f_pred = gp.conditional("f_pred",X_new)
    #Predict the distribution of samples on the new x values
    pred_samples = pm.sample_posterior_predictive(trace, var_names= ['f_pred'],samples=2000)

However my research is based on looking at this data and removing actual data points to see how the results for the predictive points start to differ. I am looking for an effective method to represent this. At the moment I have been doing (mean predicted data-mean actual data)^2 and then looking into also dividing that by the variance. I have been using the code below to do this

meanpreds = np.mean(pred_samples['f_pred'], axis=1)
diff = meanpreds - meandata
sqdiff = np.square(diff)
sumdiff = np.sum(sqdiff)

And then plotting the sum of the differences against the number of data points
My questions for this are

  1. If there is any other known ways which may be a better way to represent the gaussian data, I used this as some of my values will be negative so needed to square them.
  2. I would like to include some kind of noise/uncertainty data in this plot also and not entirely sure how to do this so any suggestions would be appreciated.
  3. Lastly I have run all of this but currently do every run separately when changing how many data points there are, Im currently working to try and optimise this with a for loop which will run through the different values for this line of code
    data = data[::2]
    but struggling to find a way to store the data for every for loop so any advice ont hat would be helpful

I’ve run into similar issues in my own research. Here are some of the visualizations I’ve done

  • What you’re describing sounds like a learning curve built up through cross-validation. Here’s how I’ve shown that before to compare how quickly different model variations improve with the amount of data they’re given. These are metrics on the test set of my data, with the size of the training set on the x-axis. I showed similar curves for the training set. I did ~100 random subets for each test/train split to get the confidence intervals. Some people overlap confidence intervals on plots like this, but I find those basically impossible to read. Showing the NLPD (negative log probability density) is good because it is a nice indicator for whether the uncertainty in your predictions is properly calibrated. A high RMSE implies your model is inaccurate whereas a high NLPD implies your model is overconfident (because the real observations are occuring in regions of low probability density, meaning the predictive uncertainty is concentrated too far away from the predictions).

  • Another way to examine calibration is to plot observed error (x-axis) vs predictive uncertainty (y-axis) at each point as a scatter plot, along with a straight line with slope=2. Roughly 95% of your points should fall below that line, otherwise your model is over/under confident.
  • You could show actual vs predicted w/ error bars, hoping that the data cluster around a line with slope 1
  • My suggestion for storing the data at each iteration of your cross-validation loop is to instead store a reproducible way to re-generate the data. For example, if you use np.random.choice to split your data into test/train, then all you need to store is the random seed and the subset size and you’ll always be able to reconstruct your test/train split.

I’ve actually written a package to make a lot of this easier, it’s called Gumbi (Github, docs). It’s designed for quickly prototyping GPs for tabular (pandas) data. It even includes a tool to help with cross-validation. You still have to run a loop, but this makes each iteration easier to write (be sure to see the note about reproducible randomness).


One side note by way of unsolicited advice, I’m suspicious that this line is going to give you issues:

y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s+er)

I did something similar here, so make sure that behavior is what you’re going for. I think the s variable there isn’t going to do anything, I’m not certain, but I suspect the posterior will look just like the prior, or maybe it’ll essentially get as close as it can to 0. And I think that including observed noise will mean that you can’t make predictions with_noise, if that’s a goal at some point.

I’m not sure if this is possible with your experimental set up, but generally I find it’s better to provide all the individual (noisy) observations to a GP, rather than aggregate them via mean/std and trying to fit a GP to the summary statistics. But of course, YMMV.

1 Like

Thank you for your help! Would it be possible to see the code you used to produce the RMSE and NLPD plots as I think this would be very useful for my project. Sorry very new to python! And yes I can see what you mean by including the error of the data, for mine it produces a very similar Gaussian distribution plot it just takes the data training points as less certain (as I was having issues with the sampling taking each point as have a zero error) although since introducing trace instead of MAP it seems to be okay so I could most likely run it without that

Also your Gumbi package looks very interesting, so would this perform a similar process to what I have written above but automate it more? And what are the results of the package, how do they cross validate?

Sure, here you go!

Here, CV is a “tidy” dataframe where each row is one metric on one particular test/train subset, with columns for Sample (‘Test’ or ‘Train’, i.e. evaluated on the test set or the training set), Metric (‘RMSE’/‘NLPD’), Score (the actual value), Model (the name of several different model variations I was comparing, SubsetSize (size of the training set), and Seed (the random seed used to define the specific test/train split). Note that this is quite large: each model has ~80 different SubsetSizes, ~100 different Seeds, and two metrics on each of two sets each. I did the plotting below on my laptop, but I ran all those tests on a computing cluster. Cross-validation is naturally parallelizable, so if you have access to a computing cluster that will make your life much easier!

colors = sns.color_palette("Set2")

shade = 1

color = {
    'Average':[rgb*shade for rgb in colors[0]],
    'Coregion':[rgb*shade for rgb in colors[1]],
    'Spatial':[rgb*shade for rgb in colors[2]],
    'GLM':[rgb*shade for rgb in colors[3]],
    'Individual':[rgb*shade for rgb in colors[-2]],

#palette = sns.light_palette(color['Spatial'])

def lq(x): return np.percentile(x,2.5)
def uq(x): return np.percentile(x,97.5)

for sample in ['Train','Test']:
    for metric in ['RMSE','NLPD']:

        grouped = cv[(cv.Sample==sample)&(cv.Metric==metric)&(cv.Score<10)].groupby(['Model','SubsetSize'])['Score'].agg([np.median, lq, uq]).reset_index()

        fig = plt.figure(tight_layout=True, figsize=(8,6))
        axs = {
        'Major' : fig.add_subplot(1,2,1),
        'Random' : fig.add_subplot(2,4,3),
        'Average' : fig.add_subplot(2,4,4),
        'Coregion' : fig.add_subplot(2,4,7),
        'Spatial' : fig.add_subplot(2,4,8),}

        yl = {'NLPD':[-0.5,7],'RMSE':[0,2.]}[metric]

        for model in major_models:
            this = grouped[grouped.Model==model]
            axs['Major'].plot(this.SubsetSize.values,this['median'].values, color=color[model], lw=3)
            for model_,ax_ in axs.items():
                if model_ not in [model,'Major']:
                    ax_.plot(this.SubsetSize.values,this['median'].values, color='0.8', zorder=-1)

            ax = axs[model]
            ax.plot(this.SubsetSize.values,this['median'].values, color=color[model], zorder=1)
            ax.fill_between(this.SubsetSize.values,this['lq'].values,this['uq'].values, color=color[model], alpha=0.5, zorder=0)

            axs['Major'].get_shared_x_axes().join(axs['Major'], ax)
            axs['Major'].get_shared_y_axes().join(axs['Major'], ax)
            ax.set_title(model, color=color[model])
        axs['Major'].set_xlabel('Subset Size')
        axs['Major'].set_title(f'{sample}ing Set')
        savemyfig(f'Cross-validation major model comparison - {sample} {metric}')

As for Gumbi, exactly. It’s just a way to build simple(ish) models with less code. An equivalent in your case would be:

import gumbi as gmb
import pandas as pd

data = pd.read_csv('exported_stress_fsw_311.csv')
data = ...

ds = gmb.DataSet(data, outputs=['s11'], log_vars=['s11'])  # I'm assuming stress is strictly non-negative, otherwise just omit the log_vars

gp = gmb.GP(ds)

See here for details on the model, but in short it’s pretty similar to what you’ve done, except it can’t incorporate observed uncertainty (yet).

If you want to just use the MAP, you can fit and make predictions this way:


# If you want automatically-defined sensible bounds based on your data
X = gp.prepare_grid()  
# If you want to explicitly set limits:
limits = gp.parray('x', [-50, 50])
X = gp.prepare_grid(limits=limits)

y = gp.predict_grid()  # y is a structured array containing mean and variance at each point

# Visualize

If you want the full trace and individual samples from the GP

gp.sample(2000, chains=4, cores=1)  # or whatever inputs you want

X = gp.prepare_grid()
var_name = 'Predicted Stress'
gp.stdzr.log_vars += [var_name]  # Only if you included s11 in 'log_vars' up above
pred_samples = gp.draw_grid_samples(source=gp.trace, var_name=var_name)

# Visualize
y= gp.uparray(var_name, μ=pred_samples.z.values().mean(0), σ2=pred_samples.z.values().var(0), stdzd=True)
gmb.ParrayPlotter(X, y).plot()

# Alternatively
from import plot_gp_dist

plot_gp_dist(plt.gca(), pred_samples.values(), X.values())

Finally, for cross-validation, you can now do:

cv_result = gp.cross_validate(n_train=10, seed=2022)
# {
#     'train': {
#         'data': ...,  # A DataSet object containing the training points
#         'NLPDs': ...,  # A numpy array containing the NLPDs for each training point
#         'errors': ...  # A numpy array containing observed-predicted for each training point
#              },
#     'test': {
#         'data': ...,
#         'NLPDs': ...,
#         'errors': ...
#             },
# }

# I would average, rather than add, your squared differences
test_rmse = np.sqrt(np.mean(np.square(cv_result['test']['errors'])))

Loop through that with k different seeds for k-fold cross-validation for each of your different n_train sizes. For now, cross_validate only works with find_MAP, though I’m meaning to change that to allow sample as well.

Hi, just to revisit something you said a few weeks ago about this line of code
y_ = gp.marginal_likelihood(“y”,X=x,y=y, noise=s+er)

Could you expand a bit more on why this is technically incorrect. I do agree and it really doesn’t make much difference to my model, but I need to justify not including it to my supervisor. For context the error I am including is standard deviation from the measured stress values and was found in the experimental/modelling stages and provided to me in a dataset

So to check what’s going on there, compare the prior and posterior for s. I’m suspicious that it will either be unchanged or become very very small (probably the latter). I could be wrong, though, if it’s a reasonable value that’s great! But you won’t be able to make predictions with_noise, because if you have 9 observed points and you’re trying to make predictions over 100 new points, the model doesn’t know how to distribute your 9 StDevs. If you don’t care about predicting noise, just about epistemic uncertainty, then this shouldn’t matter. There’s a few different scenarios that you could try, depending on what you’re trying to actually do with the GP:

  1. Use every individual measured value (if you have access to them). As an experimentalist-turned-modeller, one of the oddly unintuitive barriers to overcome for and others around me is that you should avoid fitting on aggregated data at all costs, particularly if your data is at all non-linear. It can hide all sorts of nuances in the data, and can lead your model to perform really poorly. That might be out of your power, but I’d see if you can persuade someone to track down the original measurements.
  2. You said above “I was having issues with the sampling taking each point as have a zero error”. Presumably in that scenario you were still learning s, as in y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s)? If you had noise=0 I can see why that would be problematic, but I would have thought noise=s should have been fine. You could also try an ExpQuad kernel, it’s a bit less wiggly than Matern52 so (I think) more likely to attribute small variations to noise rather than reality.
  3. You could try incorporating the observed uncertainty into y directly, by modeling y as a Normal RV with the appropriate sigma, similar to how uncertainty is incorporated into the x values in the Mauna Loa example. I haven’t tried this, so it might not even be possible, just a thought.
  4. If your standard deviations are relatively similar, or at least not systematically different, you could take the mean stdev and use it as a floor for your learned noise. I.e., y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s) but where s = pm.Deterministic('s', np.mean(stds) + pm.HalfCauchy("s_",beta=5).
  5. If there is a systematic trend to your noise, you might consider implementing a Heteroskedastic GP. Essentially, that example learns two independent latent GPs, one for each of mean behavior and noise, then combines them to model the observations. For this it would still be ideal if you have access to the individual measurements. If not, you’ll probably want to have an additional likelihood that relates the noise-GP to the observed stdevs. Again, I haven’t actually tried that, so buyer beware…

Hopefully that helps!