The basis of my project is running a gaussian process on a full set of training data, and then systematically removing training data and evaluating the how the accuracy of the predicted data points changes.
Currently I evaluate this by running predictive points at every position I have training data, and then as training data is removed gaussian processes are run on them points. I then plot the RMSE against number of data points to show how the RMSE decreases as more data training points are introduced.
My reports purpose is to advise someone with less time that they have enough data to run an accurate gaussian process. I planning on looking at percentage changes between RMSE to give a numerical point at which this occurs.
Questions
- This method does not seem to have much confidence to it and I am looking for other possible suggestions on what I could do.
- I would like to also run predictive points at points I do not have initial data training points (I have the code to run this) and quantify the accuracy of this, to my understanding I can’t use the RMSE as I haven’t always got actual data to compare the predictive points to. I was thinking of summing the variance of the predictive points for each time and plotting this against number of data training points. But also open to other suggestions.
- If there are any other goodness of fit methods which could be appropriate
- On another topic, I have standard deviations for my actual data set produced from the fitting and experimental process. I am currently trying to prove these are random and therefore just scalers using a runs test, however if I cannot would it be appropriate to add it to this line on the noise section or is that statistically wrong?
y_ = gp.marginal_likelihood(“y”,X=x,y=y, noise=s+error)
My code is below, it runs a gaussian process in a loop for different numbers of training points and then stores the information about data training points and the RMSE.
Thank you! Sorry for all the questions quite new to python.
import pandas, numpy, seaborn
import matplotlib.pyplot as plt
import pymc3 as pm
import arviz as az
import numpy as np
seaborn.set_style('white')
#Setting up X_new, this doesnt need to change X_new values always the same
actual_data = pandas.read_csv('exported_stress_fsw_311.csv')
actual_data = actual_data[actual_data['load']==0]
actual_data = actual_data[actual_data['y']==0]
X_new = actual_data['x'].to_numpy()
X_new = X_new[:,None]
ss = 's12'
datam = pandas.read_csv('exported_stress_fsw_311.csv')
datam = datam[datam['load']==25]
y0 = datam[datam['y']==0]
y0 = y0[ss].to_numpy()
y10 = datam[datam['y']==10]
y10 = y10[ss].to_numpy()
y20 = datam[datam['y']==20]
y20 = y20[ss].to_numpy()
y60 = datam[datam['y']==60]
y60 = y60[ss].to_numpy()
mean_actual = (y0+y10+y20+y60)/4
n1 = []
n2 = []
data = pandas.read_csv('exported_stress_fsw_311.csv')
data = data[data['load']==25]
for z in range(1,6):
data = data[::z]
x = data['x'].to_numpy()
y = data[ss].to_numpy()
x = x[:,None]
length = len(data)
with pm.Model() as model:
#shape
l = pm.Gamma("l", alpha=2, beta=1)
#Noise
n = pm.HalfCauchy("n",beta=5)
#Covariance
cov = n ** 2 * pm.gp.cov.Matern52(1,l)
#Specify the GP, the default mean function is zero in this case as not specified
gp = pm.gp.Marginal(cov_func=cov)
#Placing priors over the function
s = pm.HalfCauchy("s",beta=5)
#Marginal likelihood method fits the model
y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s)
#Find the parameters that best fit the data
#mp = pm.find_MAP()
trace = pm.sample()
#.conditional distribition for predictions given the X_new values
f_pred = gp.conditional("f_pred",X_new)
#Predict the distribution of samples on the new x values
pred_samples = pm.sample_posterior_predictive(trace, var_names= ['f_pred'],samples=2000)
meanpred = np.mean(pred_samples['f_pred'], axis=0)
dif = meanpred - mean_actual
sqdif = np.square(dif)
sumdif = np.sum(sqdif)
div = sumdif / 33
RMSE = np.sqrt(div)
n2.append(RMSE)
n1.append(length)
plt.plot(n1,n2,'-o')
plt.xlabel('Number of actual data points')
plt.ylabel('RMSE')