Gaussian process using noisy data training points

I’m currently implementing a gaussian process on a set of data points, and then will be removing data points to see how well the process infers the data.

When using the full data set it seems to function well with the noise function I have inputted producing the correct gaussian plot

However when I then try to implement the same code but using every other data plot the gaussian plot seems to take the training data points as having no uncertainty.

I use a similar method to the marginal_likelihood method shown in the page linked below.

https://docs.pymc.io/en/v3/pymc-examples/examples/gaussian_processes/GP-Marginal.html

If anyone had any insight into why this might be and how to fix it I would appreciate it greatly. Additionally although I do not currently use the data I do have the actual uncertainty of each data point if that could be used (but as the full data set is working without this I assume not)

Welcome!

Can you provide an example plot? And perhaps the code you are using to remove observations from you data set? I suspect that will help to diagnose what’s going on.

Of course, I have included my code below and a screenshot of the plot produced

I do believe it is something related to the best noise parameter being considerably smaller, meaning it is easier to pass through the data points then to smooth out and assume there is noise. However I also think there is an additional coding issue as when trying to add in the additional data uncertainty this code behaves very differently to the full data set.

data = pandas.read_csv('exported_stress_fsw_311.csv')
data = data[data['load']==0]
data = data[data['y']==0]
data = data[::2]
x = data['x'].to_numpy()
x = x[:,None]
y = data['s11'].to_numpy()

X_new = numpy.linspace(-50, 50, 100)[:, None]

with pm.Model() as model:
    #shape
    l = pm.Gamma("l", alpha=2, beta=1)
    #Noise
    n = pm.HalfCauchy("n",beta=5)
    #Covariance
    cov = n ** 2 * pm.gp.cov.Matern52(1,l)
    #Specify the GP, the default mean function is zero in this case as not specified
    gp = pm.gp.Marginal(cov_func=cov)
    #Placing priors over the function
    s = pm.HalfCauchy("s",beta=5)
    #Marginal likelihood method fits the model
    y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s)
    #Find the parameters that best fit the data
    mp = pm.find_MAP()
    #.conditional distribition for predictions given the X_new values 
    f_pred = gp.conditional("f_pred",X_new)
    #Predict the distribution of samples on the new x values
    pred_samples = pm.sample_posterior_predictive([mp], var_names= ['f_pred'],samples=2000)

As a comparison this is the plot produced when I use the entire data set

As I said previously I also have an additional error function from my data set which I can add to the noise function, which seems to work for the full data set (just makes the uncertainty larger) but for the partial data set it cause the plot to completely mess up. Thank you!

1 Like

Ok. And can we see what the estimated parameter values are (i.e., mp) in the 2 cases? That might help explain the behavior you are seeing.

I am definitely no GP expert (that would be @bwengals ), but I might recommend avoiding pm.find_MAP() and instead using trace = pm.sample(). find_MAP() is not recommended in other (non GP) settings and is used, for example, in this GP notebook.

In the full data set these are the estimated parameter values


And for the partial these are the estimated parameter values

And thank you for the advice about pm.find_MAP() I’ll have a go at doing it with the trace instead

So yeah, as you suspected (and is clear in the plot), the tiny value of s seems to be the culprit. I would try sampling and see what that gets you.

2 Likes

Thank you! I’ll give that a try. Also just as a sidenote I’m looking to represent this data in a more readable way for a report. I would like to get the sum of the mean points that the posterior sample has produced and then plot the difference between this and my actual data against number of samples taken. I imagine it will converge to a value of error between the two. Would you have any input on how to obtain the sum of the mean points? Sorry a bit of a beginner on python

1 Like

The mean posterior prediction at each point in X_new should be:

np.mean(pred_samples['f_pred'], axis=0)

Then you should be able to sum them. So all in one step:

np.sum(np.mean(pred_samples['f_pred'], axis=0))

Is that what you meant?

1 Like

That sounds good! Also thank you for the other advice using pm.sample() instead of the map function as it has seemed to fix it!

2 Likes

Great! I guess I should mention that collapsing the posterior predictions into their means ignores the uncertainty reflected in your posterior. Also, I will mention that you can instead sum each prediction and then take the mean and that both the result and the interpretation will differ. If the sum of the \hat{y} is the outcome of interest, then you might more interested in summing each trajectory:

np.sum(pred_samples['f_pred'], axis=1)

Then you can have a mean and standard deviation of the predicted sums:

np.mean(np.sum(pred_samples['f_pred'], axis=1))
np.std(np.sum(pred_samples['f_pred'], axis=1))

Or you can grab percentiles or whatever you might be interested in. But at that point you will at least have some reflection of the uncertainty.

1 Like

Also for the above code is there a way to only use samples which fall within a certain credible interval, say a 95% credible interval?

and unrelated but if I want to automate this code to run for different number of samples each time is there a way to put this into a for loop. Then record the data for each loop somewhere?

If I am understanding you correctly, you might like the np.percentile function. To get the 95% credible interval, I often do:

lower = np.percentile(pred_samples['f_pred'], 2.5, axis=0)
upper = np.percentile(pred_samples['f_pred'], 97.5, axis=0)

Like in @cluhmann 's earlier post, the axis keyword will compute the percentile across the samples for each data point, so you should get N lower and upper values, representing N 95% credible intervals. There shouldn’t be a need for a for loop, this should work regardless of the number of data points.

2 Likes