Analysing gaussian data

The basis of my project is running a gaussian process on a full set of training data, and then systematically removing training data and evaluating the how the accuracy of the predicted data points changes.

Currently I evaluate this by running predictive points at every position I have training data, and then as training data is removed gaussian processes are run on them points. I then plot the RMSE against number of data points to show how the RMSE decreases as more data training points are introduced.

My reports purpose is to advise someone with less time that they have enough data to run an accurate gaussian process. I planning on looking at percentage changes between RMSE to give a numerical point at which this occurs.

Questions

  1. This method does not seem to have much confidence to it and I am looking for other possible suggestions on what I could do.
  2. I would like to also run predictive points at points I do not have initial data training points (I have the code to run this) and quantify the accuracy of this, to my understanding I can’t use the RMSE as I haven’t always got actual data to compare the predictive points to. I was thinking of summing the variance of the predictive points for each time and plotting this against number of data training points. But also open to other suggestions.
  3. If there are any other goodness of fit methods which could be appropriate
  4. On another topic, I have standard deviations for my actual data set produced from the fitting and experimental process. I am currently trying to prove these are random and therefore just scalers using a runs test, however if I cannot would it be appropriate to add it to this line on the noise section or is that statistically wrong?
    y_ = gp.marginal_likelihood(“y”,X=x,y=y, noise=s+error)

My code is below, it runs a gaussian process in a loop for different numbers of training points and then stores the information about data training points and the RMSE.

Thank you! Sorry for all the questions quite new to python.

import pandas, numpy, seaborn 
import matplotlib.pyplot as plt
import pymc3 as pm
import arviz as az 
import numpy as np
seaborn.set_style('white')
#Setting up X_new, this doesnt need to change X_new values always the same 
actual_data = pandas.read_csv('exported_stress_fsw_311.csv')
actual_data = actual_data[actual_data['load']==0]
actual_data = actual_data[actual_data['y']==0]
X_new = actual_data['x'].to_numpy()
X_new = X_new[:,None]
ss = 's12'
datam = pandas.read_csv('exported_stress_fsw_311.csv')
datam = datam[datam['load']==25]
y0 = datam[datam['y']==0]
y0 = y0[ss].to_numpy()
y10 = datam[datam['y']==10]
y10 = y10[ss].to_numpy()
y20 = datam[datam['y']==20]
y20 = y20[ss].to_numpy()
y60 = datam[datam['y']==60]
y60 = y60[ss].to_numpy()

mean_actual = (y0+y10+y20+y60)/4
n1 = []
n2 = []
data = pandas.read_csv('exported_stress_fsw_311.csv')
data = data[data['load']==25]
for z in range(1,6):
    data = data[::z]
    x = data['x'].to_numpy()
    y = data[ss].to_numpy()
    x = x[:,None]
    length = len(data)
    
    with pm.Model() as model:
        #shape
        l = pm.Gamma("l", alpha=2, beta=1)
        #Noise
        n = pm.HalfCauchy("n",beta=5)
        #Covariance
        cov = n ** 2 * pm.gp.cov.Matern52(1,l)
        #Specify the GP, the default mean function is zero in this case as not specified
        gp = pm.gp.Marginal(cov_func=cov)
        #Placing priors over the function
        s = pm.HalfCauchy("s",beta=5)
        #Marginal likelihood method fits the model
        y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s)
        #Find the parameters that best fit the data
        #mp = pm.find_MAP()
        trace = pm.sample()
        #.conditional distribition for predictions given the X_new values 
        f_pred = gp.conditional("f_pred",X_new)

        #Predict the distribution of samples on the new x values
        pred_samples = pm.sample_posterior_predictive(trace, var_names= ['f_pred'],samples=2000)
    meanpred = np.mean(pred_samples['f_pred'], axis=0)
    dif = meanpred - mean_actual
    sqdif = np.square(dif)
    sumdif = np.sum(sqdif)
    div = sumdif / 33
    RMSE = np.sqrt(div)
    n2.append(RMSE)
    n1.append(length)
plt.plot(n1,n2,'-o')
plt.xlabel('Number of actual data points')
plt.ylabel('RMSE')
  1. This method does not seem to have much confidence to it and I am looking for other possible suggestions on what I could do.

Here it would help to be clear what you mean by confidence. My initial guess is that you mean that your predictive distributions for held-out training points is (1) centered away from the true value, and (2) has a relatively large uncertainty interval. Is that the case?

to my understanding I can’t use the RMSE as I haven’t always got actual data to compare the predictive points to.

But you could train the GP on a subset of data, and evaluate it on that held-out subset. Do you see any clear barriers to doing that, side from extra time and computation?

I planning on looking at percentage changes between RMSE to give a numerical point at which this occurs.

If RMSE is the only metric of interest, you may get just as much mileage out of something similar and non-statistical such as a radial basis function smoother. I think the GP’s most appealing feature is its uncertainty quantification, which doesn’t necessarily get used in a root-mean-square-error calculation.

  1. If there are any other goodness of fit methods which could be appropriate

Depending on your application, you may be interested in how well calibrated the GP is. In short, this is how often the held-out values fall within the predictive credible intervals. If you calculate the 90% credible interval for the held-out value, and repeat this experiment across many held-out subsets, you would hope that the fraction of the time that the true value falls in that 90% interval is exactly 90%. The RMSE and the calibration would supply enough information about goodness-of-fit for many applications.

On another topic, I have standard deviations for my actual data set produced from the fitting and experimental process. I am currently trying to prove these are random and therefore just scalers using a runs test, however if I cannot would it be appropriate to add it to this line on the noise section or is that statistically wrong?

In principle, this is a smart thing to do if you have these error estimates error ahead of time. I would modify your model, however, to make this fit in. The code block below isn’t placing a prior over the GP function draw, it’s actually placing a prior over the additional additive noise that is tacked on in the call to gp.marginal_likelihood.

#Placing priors over the function
   s = pm.HalfCauchy("s",beta=5)

Your actual prior over that function is declared with the prior for the variable n since you are rescaling the covariance function by n**2 in the line cov = n ** 2 * pm.gp.cov.Matern52(1,l). My opinionated recommendation is to get rid of s entirely and just plug in your values of error so that it reads
y_ = gp.marginal_likelihood("y",X=x,y=y, noise=error)

This is expressing the assumption that the underlying function is smooth, and your data are noisy with the noise scale captured in error.

2 Likes

Sorry confusing terminology, by confidence I mean I don’t have a clear way of defining for my project when somebody could stop taking data for the experiment as the gaussian process will infer the data accurately enough. The Gaussian process I run is largely within 95% confidence windows and what I would expect.

And yes at the moment I run two different gaussian processes. One looks at predictive points only on the values I have actual experimental data at, and I run the RMSE on this. The second just runs it on 100 uniformly spaced points and predicts values there, this is the situation where I am struggling to decide how to evaluate at which point the model has enough actual data, as I can’t use RMSE.

Although I do like your idea of how often the held-out values fall within the predictive credible intervals. Could you expand on this slightly as I’m not sure I fully understand, but think it would be really useful. Are you suggesting that as I run the subsets of data, 90% of the time the true data value should fall within that 90% confidence interval, and at the point which it no longer does I could define this as the point when the gaussian model does not have enough data to run accurate predictions.

I will also try implementing that last advice in my model as well and seeing the results as it does make a lot of sense, thank you!

Although I do like your idea of how often the held-out values fall within the predictive credible intervals. Could you expand on this slightly as I’m not sure I fully understand, but think it would be really useful.

Sure - this is basically just a check to make sure that the uncertainty (predictive uncertainty for GP values at new points, or inferential uncertainty for estimating parameters) which you get from your posterior matches the true underlying variation. If you repeat an experiment with held-out data 1000 times and, for each experiment, only 500 of them fall within the 90% credible interval, then your interval is too narrow whereas if 995 of them do, then you still have an issue - there are too many of them and the reported uncertainty is too large. For a variety of reasons related to optimal decision making and estimation, it is desirable to have the actual coverage probability, i.e. the number of times your true value falls within the predicted interval, as close as possible to its nominal value.

Now, it may not exactly suit your purpose as a stopping criterion; even when there are very few data points, the GP may still give predictive credible intervals (e.g. the range between the 5% and 95% percentiles of the predictive distribution) which are close to their nominal AKA desired values. If you have a large amount of data, then the predictive credible intervals may shrink to some value which you use as a threshold. Of course, this may depend on the predictive distribution over multiple points. In my experience, the implementation of GPs in PyMC along with the priors recommended in the docs will give results with coverage probabilities very close to their nominal levels for most non-pathological tasks.

1 Like