Gaussian process using noisy data training points

lucie-jackson · February 10, 2022, 2:02pm

I’m currently implementing a gaussian process on a set of data points, and then will be removing data points to see how well the process infers the data.

When using the full data set it seems to function well with the noise function I have inputted producing the correct gaussian plot

However when I then try to implement the same code but using every other data plot the gaussian plot seems to take the training data points as having no uncertainty.

I use a similar method to the marginal_likelihood method shown in the page linked below.

https://docs.pymc.io/en/v3/pymc-examples/examples/gaussian_processes/GP-Marginal.html

If anyone had any insight into why this might be and how to fix it I would appreciate it greatly. Additionally although I do not currently use the data I do have the actual uncertainty of each data point if that could be used (but as the full data set is working without this I assume not)

cluhmann · February 10, 2022, 4:30pm

Welcome!

Can you provide an example plot? And perhaps the code you are using to remove observations from you data set? I suspect that will help to diagnose what’s going on.

lucie-jackson · February 11, 2022, 10:16am

Of course, I have included my code below and a screenshot of the plot produced

I do believe it is something related to the best noise parameter being considerably smaller, meaning it is easier to pass through the data points then to smooth out and assume there is noise. However I also think there is an additional coding issue as when trying to add in the additional data uncertainty this code behaves very differently to the full data set.

data = pandas.read_csv('exported_stress_fsw_311.csv')
data = data[data['load']==0]
data = data[data['y']==0]
data = data[::2]
x = data['x'].to_numpy()
x = x[:,None]
y = data['s11'].to_numpy()

X_new = numpy.linspace(-50, 50, 100)[:, None]

with pm.Model() as model:
    #shape
    l = pm.Gamma("l", alpha=2, beta=1)
    #Noise
    n = pm.HalfCauchy("n",beta=5)
    #Covariance
    cov = n ** 2 * pm.gp.cov.Matern52(1,l)
    #Specify the GP, the default mean function is zero in this case as not specified
    gp = pm.gp.Marginal(cov_func=cov)
    #Placing priors over the function
    s = pm.HalfCauchy("s",beta=5)
    #Marginal likelihood method fits the model
    y_ = gp.marginal_likelihood("y",X=x,y=y, noise=s)
    #Find the parameters that best fit the data
    mp = pm.find_MAP()
    #.conditional distribition for predictions given the X_new values 
    f_pred = gp.conditional("f_pred",X_new)
    #Predict the distribution of samples on the new x values
    pred_samples = pm.sample_posterior_predictive([mp], var_names= ['f_pred'],samples=2000)

As a comparison this is the plot produced when I use the entire data set

As I said previously I also have an additional error function from my data set which I can add to the noise function, which seems to work for the full data set (just makes the uncertainty larger) but for the partial data set it cause the plot to completely mess up. Thank you!

cluhmann · February 11, 2022, 2:28pm

Ok. And can we see what the estimated parameter values are (i.e., mp) in the 2 cases? That might help explain the behavior you are seeing.

I am definitely no GP expert (that would be @bwengals ), but I might recommend avoiding pm.find_MAP() and instead using trace = pm.sample(). find_MAP() is not recommended in other (non GP) settings and is used, for example, in this GP notebook.

lucie-jackson · February 11, 2022, 2:52pm

In the full data set these are the estimated parameter values

And for the partial these are the estimated parameter values

And thank you for the advice about pm.find_MAP() I’ll have a go at doing it with the trace instead

cluhmann · February 11, 2022, 2:54pm

So yeah, as you suspected (and is clear in the plot), the tiny value of s seems to be the culprit. I would try sampling and see what that gets you.

lucie-jackson · February 11, 2022, 4:08pm

Thank you! I’ll give that a try. Also just as a sidenote I’m looking to represent this data in a more readable way for a report. I would like to get the sum of the mean points that the posterior sample has produced and then plot the difference between this and my actual data against number of samples taken. I imagine it will converge to a value of error between the two. Would you have any input on how to obtain the sum of the mean points? Sorry a bit of a beginner on python

cluhmann · February 11, 2022, 7:03pm

The mean posterior prediction at each point in X_new should be:

np.mean(pred_samples['f_pred'], axis=0)

Then you should be able to sum them. So all in one step:

np.sum(np.mean(pred_samples['f_pred'], axis=0))

Is that what you meant?

lucie-jackson · February 11, 2022, 7:21pm

That sounds good! Also thank you for the other advice using pm.sample() instead of the map function as it has seemed to fix it!

cluhmann · February 11, 2022, 7:46pm

Great! I guess I should mention that collapsing the posterior predictions into their means ignores the uncertainty reflected in your posterior. Also, I will mention that you can instead sum each prediction and then take the mean and that both the result and the interpretation will differ. If the sum of the \hat{y} is the outcome of interest, then you might more interested in summing each trajectory:

np.sum(pred_samples['f_pred'], axis=1)

Then you can have a mean and standard deviation of the predicted sums:

np.mean(np.sum(pred_samples['f_pred'], axis=1))
np.std(np.sum(pred_samples['f_pred'], axis=1))

Or you can grab percentiles or whatever you might be interested in. But at that point you will at least have some reflection of the uncertainty.

lucie-jackson · February 13, 2022, 11:20am

Also for the above code is there a way to only use samples which fall within a certain credible interval, say a 95% credible interval?

and unrelated but if I want to automate this code to run for different number of samples each time is there a way to put this into a for loop. Then record the data for each loop somewhere?

Martin_Ingram · February 13, 2022, 12:15pm

If I am understanding you correctly, you might like the np.percentile function. To get the 95% credible interval, I often do:

lower = np.percentile(pred_samples['f_pred'], 2.5, axis=0)
upper = np.percentile(pred_samples['f_pred'], 97.5, axis=0)

Like in @cluhmann 's earlier post, the axis keyword will compute the percentile across the samples for each data point, so you should get N lower and upper values, representing N 95% credible intervals. There shouldn’t be a need for a for loop, this should work regardless of the number of data points.

Topic		Replies	Views
Analysing gaussian data v3 gaussian_process	3	575	March 23, 2022
Representing data from Gaussian process Questions gaussian_process	7	1292	April 1, 2022
Noise estimate for training points v3 gaussian_process	0	340	April 18, 2022
Latent Gaussian Processes with input uncertainties Questions	2	512	May 22, 2018
Gaussian Process -Statistical rethinking v5 gaussian_process	2	836	January 28, 2023

Gaussian process using noisy data training points

Related topics