Bayesian Analysis of Posterior Predictions from GP model

I am trying to output a posterior probability distribution of parameters using 10000 points extracted from the posterior of a GP that was trained on a small set of simulation data. The prior of the parameters given the outputs is assumed to be non-informative uniform for each parameter. There are 13 parameters and 13 output properties in the model. I am using a multivariate normal likelihood function that is derived as the following:
where Yobs is the experimental value for the property calculated in the simulation, muGPM is the mean GP prediction and sigma is the diagonal covariance matrix with each element given by the sum of the variance of the GP prediction and a noise variance where the noise standard deviation prior is assumed to be Half Normal.

Could someone take a look at my code and see if I am in the right direction? I am new to data science and to PyMC3 so any advice would be extremely helpful. I established the 13 parameter priors separately, but I will only include a couple because they are similar and I am not sure if this is the best way to guide this. I establish testvals based on optimized parameter outputs from the initial simulation. The chain lengths are at 10000 just to get a decent initial look at the parameter distributions, but 80000 is the approximated length for convergence.

with pm.Model() as basic_model:
  var1 = pm.Uniform('var1', lower=16.0, upper=18.0, testval=16.94)
  var12 = pm.Uniform('var12', lower=0.4, upper=0.8, testval=0.679)
  var13 = pm.Uniform('var13', lower=6.0, upper= 10.2, testval=8.16)
  sigNoise = pm.HalfNormal('sigNoise')
  cov_noise = pm.Deterministic("cov_noise",(cov + sigNoise**2) * np.eye(13))
  likelihood = pm.MvNormal('likelihood', mu=means, cov=cov_noise, observed=yobs)
  trace = pm.sample(10000, chains=2,cores=1)`

Hi Troy,
Thanks for participating! From a quick glance, I think I’d have two comments:

  1. I’d use more informative priors that uniform distributions, especially for GPs. This will help guarding against overfitting and just help the sampler overall – here are good recommendations for priors. For these reasons, uniforms are very rarely a good choice, and they are actually very informative at the boundaries – they give virtually infinite weight to the boundaries; see this for details.
  2. Any reason you’re not using PyMC3’s GP module? It is optimized for performance and will take care of a lot of technical details for you.

Hope this helps :vulcan_salute:

The main reason that the GP is used is to create the 10000 data points from a small set of evenly spread out simulation data as apposed to getting the 10000 data points from just simulations, which is very computationally expensive. Each property that is calculated has its own optimized kernel through Grid Search methods. I looked into the GP module, but I wasn’t sure if it would be running another GP on the data outputted from GP model that was already created, and it seemed that I would have to specify a kernel for the method even though multiple kernels were used to create the posterior data set. I would like to calculate the probability distribution of each parameter as if the data set from the GP were simulation data. I am trying to reduce the amount of external bias on the parameter set’s Bayesian analysis, so I am using a uniform prior. I am not sure if this clarified my situation, but I hope it did.

Also, when I got outputs from the code I posted earlier, it seemed that the parameters only extracted from the prior. I’ve been having a hard time figuring out exactly how to incorporate the parameter priors into the actual code where the priors are p(theta) in Bayes Theorem: image
I saw that in the blackbox example uniform priors were put into the likelihood function using theano.tensor.as_tensor_variable and having observed=(parameters) in the likelihood function. The likelihood function was custom so the example uses DensityDist. Is there a way I can incorporate the parameters like this example through MvNormal?
Thanks for the help so far, I didn’t realize that putting bounds on the priors would make them informative even though that seems pretty obvious now.

Thanks for the clarification but I’m not sure I understand what you’re trying to do: are you trying to fit a GP model with MCMC?
But it sounds like you already have posterior samples. So, are you trying to do posterior retrodictive checks (i.e check whether the data generated by your model are in the realm of the possible compared to your observed data)?
In the latter case, I don’t think specifying a whole PyMC model just to do this (and not also MCMC) is worth it. But tell me if I’m mistaken, I’m not sure I understood your goal here

My goal is to quantify the uncertainty of the parameters used in the (molecular dynamic) simulation itself. So, check whether the data by outputs of the simulation are in the possible compared to observed data (experimental data). If we can obtain probability distributions of the parameters, we can effectively define the robustness of the parameters in the model output from the simulation. Small perturbations in the parameters can result in large changes in the outputs. The main theory has been published in a couple papers, so I am less worried on the theory and more on the implementation. I have used other methods to check if the GP represents the simulation data.
Right now my main concern is incorporating both the observed (experimental) data and the parameters into MvNormal where x is observed data in image
I think that the parameters would be pi in the equation, but I am not totally certain.
Let me know if this clears things up or not.

Ok, thanks! So it sounds to me that you need both MCMC to estimate the uncertainty in your parameters’ posterior, as well as posterior predictive sampling to check the generative accuracy of your model.

pi and T would be parameters here, and thus would need priors.

I don’t see a big problem here: you’re using the observed kwarg as expected. Note though that it seems you’re not inferring means, cov and var*. Again, I would also use much more informative priors than Uniforms.
In case that helps, here are examples of GP regression using PyMC3 (start at Code 14.30) – this is a port of Statistical Rethinking 2, chapter 14

1 Like

Yes, exactly! Although the method may seem weird, it is the beginning of my journey down Bayesian analytics so I must use preexisting published methods. Surrogate modeling is employed to generate the large data necessary to use non-informative Uniform priors, since appropriate priors are unknown at this point in my research. I look to use more informative priors in the future when surrogate modeling or large data is not accessible.

Currently, I am importing the data generated from the surrogate model as mu and cov is the covariance function of the outputs from the surrogate model. I turn var* into a single variable theta using theano.tensor.as_tensor_variable(). In the derived likelihood function, it is known that mu is a function of theta, but mu is the posterior prediction dataset from the GP. Is there a way that I can infer the dependence of theta on the existing dataset, mu? If not, do you have any initial ideas to infer theta into the likelihood function in a meaningful way?

1 Like

Also, I will link the paper that I am basing my statements. Sections 2.3 and 3.3 are the sections of interest.

I think you can just do theta = pm.Uniform("theta", np.array([16.0, 0.4, 6.0]), np.array([18.0, 0.8, 10.2]), shape=3)

Ok sweet, I’ll use that instead. Also, this weekend I found out an actual question to what I have been trying to ask in the past. Currently, I use GPy to create the surrogate model of the initial data and then wanted to guide Bayesian analysis on that data.I understand now that I need my mu to be a function that incorporates the GP model from GPy. I do this using gp.predict function from GPy to represent the function for mu, but this function does not take symbolic variables. I found this example that talks about the same issue, but I cannot get it to work in the slightest. Do you have any ideas on creating a Class to transform the symbolic variables into something like a numpy array or something deterministic so it can work with the GPy gp.predict function?

I think @junpenglao is better positioned than me to give advice here. Also note that this is quite an advanced use for a first PyMC model

I might be missing a bit of context here but seems you are trying to do Bayesian Optimization? I am confused about why you will need a surrogate model, and why you need the output from gp.predict as the function of mu - I would imagine that these are estimated using the full data set.

I didn’t realize it but yes I am doing Bayesian Optimization. I ditched the idea of working with the GPy model and built this code instead, does it seem right? I was getting a mass matrix contains zeros on the diagonal error when I was running with a small test data set.

with pm.Model() as model:
    theta = pm.Uniform('theta', lower=np.array([#lowest input values of data set]), upper=np.array([#highest input values of data set]), shape=13)
    m =
    cov_func =, ls=theta)
    gp =, cov_func=cov_func)
    sig_noise = pm.HalfNormal('sig_noise')
    y_ = gp.marginal_likelihood("y_", X=features, y=labels, noise=sig_noise, is_observed=False, observed=yobs)

Also, when I increase the data set to the actual data set, the time is extremely large. Is this normal?

This seems like a good first step to me. I don’t have a lot of experience with GPs yet, but here are some comments:

  • I don’t think you should use observed: yobs should probably be passed to y, since this kwarg takes the outcomes you observed and are evaluating the GP on.
  • In any case, I’m pretty sure you can’t use both is_observed and observed at the same time, since the former kwarg is used to set y as an (un)observed variable in the model.
  • I’m not surprised by the mass matrix error: I’m pretty sure your prior on theta is not unformative enough, which is giving the sampler a hard time because the geometry it’s seeing is pretty flat. I’m guessing this problem is even more pronounced when you increase the data set – GPs are O(n3) complexity – hence the big sampling time. I’d advise doing prior predictive checks, so you can see the consequences of your choice of theta priors on the GP functions.
    Gamma distributions are usually a good choice for priors on lengthscales, as they are very flexible, so you can adjust the amount of information. For more details, you can read the GP chapter in Osvaldo Martin’s book, as well as the Mauna Loa example on PyMC docs. Finally, for more information about the pathological behavior of GPs, you can read this series of articles.

Hope this helps :vulcan_salute:


You probably also want theta to just be a single number, so don’t set shape.

1 Like

Ah true, didn’t see that :+1:

If I remove shape how would I get the posterior distributions for all 13 parameters? Should I just declare them all separately?

theta is the lengthscale of the GP’s kernel, so I think it’s one-dimensional