# Representing data from Gaussian process

I have implemented a gaussian process using my actual data points to generate predicted data points. This was done using the code below and is all functioning well.

``````data = pandas.read_csv('exported_stress_fsw_311.csv')
data = data[data['y']==0]
data = data[::2]

x = data['x'].to_numpy()

x = x[:,None]

y = data['s11'].to_numpy()

er = data['s11_std'].to_numpy()
X_new = numpy.linspace(-50, 50, 100)[:, None]

with pm.Model() as model:

l = pm.Gamma("l", alpha=2, beta=1)

n = pm.HalfCauchy("n",beta=5)

cov = n ** 2 * pm.gp.cov.Matern52(1,l)
#Specify the GP, the default mean function is zero in this case as not specified
gp = pm.gp.Marginal(cov_func=cov)
#Placing priors over the function
s = pm.HalfCauchy("s",beta=5)
#Marginal likelihood method fits the model
y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s+er)
#Find the parameters that best fit the data
#mp = pm.find_MAP()
trace = pm.sample()
#.conditional distribition for predictions given the X_new values
f_pred = gp.conditional("f_pred",X_new)

#Predict the distribution of samples on the new x values
pred_samples = pm.sample_posterior_predictive(trace, var_names= ['f_pred'],samples=2000)
``````

However my research is based on looking at this data and removing actual data points to see how the results for the predictive points start to differ. I am looking for an effective method to represent this. At the moment I have been doing (mean predicted data-mean actual data)^2 and then looking into also dividing that by the variance. I have been using the code below to do this

``````meanpreds = np.mean(pred_samples['f_pred'], axis=1)
diff = meanpreds - meandata
sqdiff = np.square(diff)
sumdiff = np.sum(sqdiff)
sumdiff
``````

And then plotting the sum of the differences against the number of data points
My questions for this are

1. If there is any other known ways which may be a better way to represent the gaussian data, I used this as some of my values will be negative so needed to square them.
2. I would like to include some kind of noise/uncertainty data in this plot also and not entirely sure how to do this so any suggestions would be appreciated.
3. Lastly I have run all of this but currently do every run separately when changing how many data points there are, Im currently working to try and optimise this with a for loop which will run through the different values for this line of code
`data = data[::2]`
but struggling to find a way to store the data for every for loop so any advice ont hat would be helpful

Iâve run into similar issues in my own research. Here are some of the visualizations Iâve done

• What youâre describing sounds like a learning curve built up through cross-validation. Hereâs how Iâve shown that before to compare how quickly different model variations improve with the amount of data theyâre given. These are metrics on the test set of my data, with the size of the training set on the x-axis. I showed similar curves for the training set. I did ~100 random subets for each test/train split to get the confidence intervals. Some people overlap confidence intervals on plots like this, but I find those basically impossible to read. Showing the NLPD (negative log probability density) is good because it is a nice indicator for whether the uncertainty in your predictions is properly calibrated. A high RMSE implies your model is inaccurate whereas a high NLPD implies your model is overconfident (because the real observations are occuring in regions of low probability density, meaning the predictive uncertainty is concentrated too far away from the predictions).

• Another way to examine calibration is to plot observed error (x-axis) vs predictive uncertainty (y-axis) at each point as a scatter plot, along with a straight line with slope=2. Roughly 95% of your points should fall below that line, otherwise your model is over/under confident.
• You could show actual vs predicted w/ error bars, hoping that the data cluster around a line with slope 1
• My suggestion for storing the data at each iteration of your cross-validation loop is to instead store a reproducible way to re-generate the data. For example, if you use `np.random.choice` to split your data into test/train, then all you need to store is the random seed and the subset size and youâll always be able to reconstruct your test/train split.

Iâve actually written a package to make a lot of this easier, itâs called Gumbi (Github, docs). Itâs designed for quickly prototyping GPs for tabular (`pandas`) data. It even includes a tool to help with cross-validation. You still have to run a loop, but this makes each iteration easier to write (be sure to see the note about reproducible randomness).

4 Likes

One side note by way of unsolicited advice, Iâm suspicious that this line is going to give you issues:

``````y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s+er)
``````

I did something similar here, so make sure that behavior is what youâre going for. I think the `s` variable there isnât going to do anything, Iâm not certain, but I suspect the posterior will look just like the prior, or maybe itâll essentially get as close as it can to 0. And I think that including observed noise will mean that you canât make predictions `with_noise`, if thatâs a goal at some point.

Iâm not sure if this is possible with your experimental set up, but generally I find itâs better to provide all the individual (noisy) observations to a GP, rather than aggregate them via mean/std and trying to fit a GP to the summary statistics. But of course, YMMV.

1 Like

Thank you for your help! Would it be possible to see the code you used to produce the RMSE and NLPD plots as I think this would be very useful for my project. Sorry very new to python! And yes I can see what you mean by including the error of the data, for mine it produces a very similar Gaussian distribution plot it just takes the data training points as less certain (as I was having issues with the sampling taking each point as have a zero error) although since introducing trace instead of MAP it seems to be okay so I could most likely run it without that

Also your Gumbi package looks very interesting, so would this perform a similar process to what I have written above but automate it more? And what are the results of the package, how do they cross validate?

Sure, here you go!

Here, CV is a âtidyâ dataframe where each row is one metric on one particular test/train subset, with columns for Sample (âTestâ or âTrainâ, i.e. evaluated on the test set or the training set), Metric (âRMSEâ/âNLPDâ), Score (the actual value), Model (the name of several different model variations I was comparing, SubsetSize (size of the training set), and Seed (the random seed used to define the specific test/train split). Note that this is quite large: each model has ~80 different SubsetSizes, ~100 different Seeds, and two metrics on each of two sets each. I did the plotting below on my laptop, but I ran all those tests on a computing cluster. Cross-validation is naturally parallelizable, so if you have access to a computing cluster that will make your life much easier!

``````colors = sns.color_palette("Set2")

color = {
'Random':'0.5',
}

#palette = sns.light_palette(color['Spatial'])

def lq(x): return np.percentile(x,2.5)
def uq(x): return np.percentile(x,97.5)

for sample in ['Train','Test']:
for metric in ['RMSE','NLPD']:

grouped = cv[(cv.Sample==sample)&(cv.Metric==metric)&(cv.Score<10)].groupby(['Model','SubsetSize'])['Score'].agg([np.median, lq, uq]).reset_index()

fig = plt.figure(tight_layout=True, figsize=(8,6))
axs = {

yl = {'NLPD':[-0.5,7],'RMSE':[0,2.]}[metric]

for model in major_models:
this = grouped[grouped.Model==model]
axs['Major'].plot(this.SubsetSize.values,this['median'].values, color=color[model], lw=3)
for model_,ax_ in axs.items():
if model_ not in [model,'Major']:
ax_.plot(this.SubsetSize.values,this['median'].values, color='0.8', zorder=-1)

ax = axs[model]
ax.plot(this.SubsetSize.values,this['median'].values, color=color[model], zorder=1)
ax.fill_between(this.SubsetSize.values,this['lq'].values,this['uq'].values, color=color[model], alpha=0.5, zorder=0)

axs['Major'].get_shared_x_axes().join(axs['Major'], ax)
axs['Major'].get_shared_y_axes().join(axs['Major'], ax)
ax.set_yticklabels('')
ax.set_xticklabels('')
ax.set_xticks(np.arange(0,81,20))
ax.set_title(model, color=color[model])
ax.set_ylim(yl)
axs['Major'].set_xlabel('Subset Size')
axs['Major'].set_ylabel(metric)
axs['Major'].set_title(f'{sample}ing Set')
axs['Major'].set_xticks(np.arange(0,81,20))

savemyfig(f'Cross-validation major model comparison - {sample} {metric}')
``````

As for Gumbi, exactly. Itâs just a way to build simple(ish) models with less code. An equivalent in your case would be:

``````import gumbi as gmb
import pandas as pd

data = ...

ds = gmb.DataSet(data, outputs=['s11'], log_vars=['s11'])  # I'm assuming stress is strictly non-negative, otherwise just omit the log_vars

gp = gmb.GP(ds)
gp.specify_model(continuous_dims=['x'])
gp.build_model(continuous_kernel='Matern52')
``````

See here for details on the model, but in short itâs pretty similar to what youâve done, except it canât incorporate observed uncertainty (yet).

If you want to just use the MAP, you can fit and make predictions this way:

``````gp.find_MAP()

# If you want automatically-defined sensible bounds based on your data
X = gp.prepare_grid()
# If you want to explicitly set limits:
limits = gp.parray('x', [-50, 50])
X = gp.prepare_grid(limits=limits)

y = gp.predict_grid()  # y is a structured array containing mean and variance at each point

# Visualize
gmb.ParrayPlotter(X,y).plot()
``````

If you want the full trace and individual samples from the GP

``````gp.sample(2000, chains=4, cores=1)  # or whatever inputs you want

X = gp.prepare_grid()
var_name = 'Predicted Stress'
gp.stdzr.log_vars += [var_name]  # Only if you included s11 in 'log_vars' up above
pred_samples = gp.draw_grid_samples(source=gp.trace, var_name=var_name)

# Visualize
y= gp.uparray(var_name, Îź=pred_samples.z.values().mean(0), Ď2=pred_samples.z.values().var(0), stdzd=True)
gmb.ParrayPlotter(X, y).plot()

# Alternatively
from pymc3.gp.util import plot_gp_dist

plot_gp_dist(plt.gca(), pred_samples.values(), X.values())
``````

Finally, for cross-validation, you can now do:

``````cv_result = gp.cross_validate(n_train=10, seed=2022)
# {
#     'train': {
#         'data': ...,  # A DataSet object containing the training points
#         'NLPDs': ...,  # A numpy array containing the NLPDs for each training point
#         'errors': ...  # A numpy array containing observed-predicted for each training point
#              },
#     'test': {
#         'data': ...,
#         'NLPDs': ...,
#         'errors': ...
#             },
# }

test_rmse = np.sqrt(np.mean(np.square(cv_result['test']['errors'])))
``````

Loop through that with k different seeds for k-fold cross-validation for each of your different n_train sizes. For now, `cross_validate` only works with `find_MAP`, though Iâm meaning to change that to allow `sample` as well.

So to check whatâs going on there, compare the prior and posterior for `s`. Iâm suspicious that it will either be unchanged or become very very small (probably the latter). I could be wrong, though, if itâs a reasonable value thatâs great! But you wonât be able to make predictions `with_noise`, because if you have 9 observed points and youâre trying to make predictions over 100 new points, the model doesnât know how to distribute your 9 StDevs. If you donât care about predicting noise, just about epistemic uncertainty, then this shouldnât matter. Thereâs a few different scenarios that you could try, depending on what youâre trying to actually do with the GP:
2. You said above âI was having issues with the sampling taking each point as have a zero errorâ. Presumably in that scenario you were still learning `s`, as in `y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s)`? If you had `noise=0` I can see why that would be problematic, but I would have thought `noise=s` should have been fine. You could also try an ExpQuad kernel, itâs a bit less wiggly than Matern52 so (I think) more likely to attribute small variations to noise rather than reality.
3. You could try incorporating the observed uncertainty into `y` directly, by modeling `y` as a Normal RV with the appropriate sigma, similar to how uncertainty is incorporated into the `x` values in the Mauna Loa example. I havenât tried this, so it might not even be possible, just a thought.
4. If your standard deviations are relatively similar, or at least not systematically different, you could take the mean stdev and use it as a floor for your learned noise. I.e., `y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s)` but where `s = pm.Deterministic('s', np.mean(stds) + pm.HalfCauchy("s_",beta=5)`.