I have implemented a gaussian process using my actual data points to generate predicted data points. This was done using the code below and is all functioning well.
data = pandas.read_csv('exported_stress_fsw_311.csv') data = data[data['load']==0] data = data[data['y']==0] data = data[::2] x = data['x'].to_numpy() x = x[:,None] y = data['s11'].to_numpy() er = data['s11_std'].to_numpy() X_new = numpy.linspace(-50, 50, 100)[:, None] with pm.Model() as model: l = pm.Gamma("l", alpha=2, beta=1) n = pm.HalfCauchy("n",beta=5) cov = n ** 2 * pm.gp.cov.Matern52(1,l) #Specify the GP, the default mean function is zero in this case as not specified gp = pm.gp.Marginal(cov_func=cov) #Placing priors over the function s = pm.HalfCauchy("s",beta=5) #Marginal likelihood method fits the model y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s+er) #Find the parameters that best fit the data #mp = pm.find_MAP() trace = pm.sample() #.conditional distribition for predictions given the X_new values f_pred = gp.conditional("f_pred",X_new) #Predict the distribution of samples on the new x values pred_samples = pm.sample_posterior_predictive(trace, var_names= ['f_pred'],samples=2000)
However my research is based on looking at this data and removing actual data points to see how the results for the predictive points start to differ. I am looking for an effective method to represent this. At the moment I have been doing (mean predicted data-mean actual data)^2 and then looking into also dividing that by the variance. I have been using the code below to do this
meanpreds = np.mean(pred_samples['f_pred'], axis=1) diff = meanpreds - meandata sqdiff = np.square(diff) sumdiff = np.sum(sqdiff) sumdiff
And then plotting the sum of the differences against the number of data points
My questions for this are
- If there is any other known ways which may be a better way to represent the gaussian data, I used this as some of my values will be negative so needed to square them.
- I would like to include some kind of noise/uncertainty data in this plot also and not entirely sure how to do this so any suggestions would be appreciated.
- Lastly I have run all of this but currently do every run separately when changing how many data points there are, Im currently working to try and optimise this with a for loop which will run through the different values for this line of code
data = data[::2]
but struggling to find a way to store the data for every for loop so any advice ont hat would be helpful