I have implemented a gaussian process using my actual data points to generate predicted data points. This was done using the code below and is all functioning well.
data = pandas.read_csv('exported_stress_fsw_311.csv')
data = data[data['load']==0]
data = data[data['y']==0]
data = data[::2]
x = data['x'].to_numpy()
x = x[:,None]
y = data['s11'].to_numpy()
er = data['s11_std'].to_numpy()
X_new = numpy.linspace(-50, 50, 100)[:, None]
with pm.Model() as model:
l = pm.Gamma("l", alpha=2, beta=1)
n = pm.HalfCauchy("n",beta=5)
cov = n ** 2 * pm.gp.cov.Matern52(1,l)
#Specify the GP, the default mean function is zero in this case as not specified
gp = pm.gp.Marginal(cov_func=cov)
#Placing priors over the function
s = pm.HalfCauchy("s",beta=5)
#Marginal likelihood method fits the model
y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s+er)
#Find the parameters that best fit the data
#mp = pm.find_MAP()
trace = pm.sample()
#.conditional distribition for predictions given the X_new values
f_pred = gp.conditional("f_pred",X_new)
#Predict the distribution of samples on the new x values
pred_samples = pm.sample_posterior_predictive(trace, var_names= ['f_pred'],samples=2000)
However my research is based on looking at this data and removing actual data points to see how the results for the predictive points start to differ. I am looking for an effective method to represent this. At the moment I have been doing (mean predicted data-mean actual data)^2 and then looking into also dividing that by the variance. I have been using the code below to do this
meanpreds = np.mean(pred_samples['f_pred'], axis=1)
diff = meanpreds - meandata
sqdiff = np.square(diff)
sumdiff = np.sum(sqdiff)
sumdiff
And then plotting the sum of the differences against the number of data points
My questions for this are
- If there is any other known ways which may be a better way to represent the gaussian data, I used this as some of my values will be negative so needed to square them.
- I would like to include some kind of noise/uncertainty data in this plot also and not entirely sure how to do this so any suggestions would be appreciated.
- Lastly I have run all of this but currently do every run separately when changing how many data points there are, Im currently working to try and optimise this with a for loop which will run through the different values for this line of code
data = data[::2]
but struggling to find a way to store the data for every for loop so any advice ont hat would be helpful