Parameter Uncertainty Quantifiaction using Gaussian Processes (Latent vs Marginal)

Hello,

I’m working with my own biological model that is relatively slow to run, and has a relatively high number of input parameters (~8-15 depending on approach), as well as one input variable (calcium concentration), to predict one output variable (force). Ultimately, I’m looking to carry out some kind of uncertainty quantification as I fit my parameters to experimental measurements, but because my biological model is relatively slow, I am hoping to use a Gaussian Process surrogate model to help carry out this analysis.

I believe that I should be able to use a marginal GP, rather than a latent GP in this case because I can assume normal likelihood, and I’m not carrying out any additional components to the model.
I’ve create a simplified 2D surrogate model below that only has one input parameter (nH) and one input (pCa) which seems to be behaving as expected, but I wanted to check with experts if using a marginal GP in this surrogate fitting approach seems reasonable.

with pm.Model() as gp_model_2D:
    # Priors for kernel hyperparameters
    # Lengthscale: how quickly the function varies
    ell = [pm.Gamma('ell0', alpha=2, beta=1), 
           pm.Gamma('ell1', alpha=2, beta=1)]
    
    # Amplitude: typical scale of function values
    eta = pm.HalfNormal('eta', sigma=3)
    
    # Covariance function
    cov_func = eta**2 * pm.gp.cov.ExpQuad(input_dim=2, ls=ell)
    
    # mean function
    mf = pm.gp.mean.Constant(0.5)

    # GP prior with zero mean
    gp = pm.gp.Marginal(cov_func=cov_func, mean_func=mf)
    
    # Noise variance prior
    sigma = pm.HalfNormal('sigma', sigma=0.01)
    
    # Marginal likelihood
    y_obs = gp.marginal_likelihood('y_obs', X=X_data, y=y_data, sigma=sigma)
    
    # Sample from posterior
    trace_gp = pm.sample(draws = 1000, tune=1000, random_seed=43, return_inferencedata=True, nuts_sampler='nutpie', 
                            cores = 16)

Of course the next step will be to actually have another instance of the model where I am actually fitting data to the experimentally observed values and estimating parameter values like nH.

Thanks so much for any feedback or suggestions. I’m also comparing this approach to using a neural network surrogate model, and would be curious if anyone has suggestions between the two.

If you have normal likelihoods, then marginalizing a GP prior is the way to go when you can do it. That’s true for just about any marginalization in any model, hierarchical or otherwise, because of the Rao-Blackwell theorem (which is admittedly hard to parse into the form you need if you’re not familiar with it).

Things get a bit more complicated with Hamiltonian Monte Carlo when the unmarginalized density has a much easier geometry to navigate. That’s because HMC is much more sensitive to geometry (high and varying curvature, particularly) than it is to dimensionality.

So as usual in stochastic processes, your mileage may vary, so you might want to try both approaches.

1 Like

@zaxtax wrote a nice note about Rao-Blackwell and the benefits of marginalization in the automatic marginalization example.

1 Like

Seconding the above, and would also add that paying close attention to your lengthscale priors will take you a long way. Something like an InverseGamma is a good default, because it has a heavy right tail and restricts small lengthscales. Also, Matern52 will often work much better as a default kernel. ExpQuad puts a very strong prior on smoothness, though this will matter less as input_dim increases. Both of these things will affect posterior geometry, and so can speed up sampling quite a bit, to tie into what @bob-carpenter said.

1 Like

You have to be careful with the tails here when you don’t have a lot of data informing the length scales. The really diffuse InverseGamma(epsilon, epsilon) priors can be problematic in shifting the posterior out into the tails by placing the bulk of the probability mass there. Andrew Gelman wrote a paper that’s largely about hierarchical priors for length scales. Going to a hierarchical scale model gives you what is known in GP-land as “automatic relevance detection” (ARD) (following Radford Neal’s thesis). Andrew wrote another paper with Vince Dorie (a grad student of his) on zero-avoiding priors (aka penalties) for penalized maximum likelihood estimates of hierarchical models (which would otherwise require max marginal likelihood estimates).

1 Like

I thought that maybe in addition to the answers about the details of the GPs, it might be useful to also address the (possibly?) underlying question about surrogate models.

I’m not sure I fully understand what precisely you are trying to achieve with the GP, from what you wrote I assume it is something like the following?

You have some external, deterministic model that predicts force based on pCa and a set of hyperparameters, so let’s say a function \text{expected_force}(p_{\text{Ca}}; \theta_1,\dots \theta_n). You also have measurements \text{force}^{(i)} for corresponding values p_{Ca}^{(i)}, and you would like to get posterior samples for the parameters \theta_1,\dots,\theta_n. But evaluating the \text{expected_force} function is slow, so that you have trouble to just sample the posterior directly.

The idea is to fit a neural network or a GP to predict the function \text{expected_force}? So this surrogate model would take n + 1 inputs (n values for \theta and the calcium concentration), and return one output (the expected force). To train it you would take some parameter and p_\text{Ca} values, evaluate your expensive function, and fit the GP or neural network on this data, to find an approximation \text{expected_force}^*. You can then plug this faster function into the actual model to sample the posterior for the parameters.

Is that the general idea, or did I misunderstand something? Assuming this is correct, here are a couple of thoughts, but take them with a grain of salt…

  • My first instinct for a surrogate model would probably be a neural network, probably not a GP. GPs can work in ~20 dimensional space, but they depend a lot on a consistent and meaningful sense of distance in the space where they are defined. So you would want to expect about equal changes in the output of the function if you change any of the parameters by the same amount, or the lengthscale doesn’t really make much sense. I think that is rarely the case unless you take great care to rescale appropriately. I think there are also a couple of other options:
  • I don’t necessarily see a good reason to fit this surrogate model with PyMC. Do you really need the uncertainty of the surrogate model? I can think of uses for this, but is it worth the cost? And if you fit a GP with PyMC it is not completely trivial (or cheap) to compute expected values at new locations, which you need to do the uncertainty quantification for the parameters. Using a library that is made for surrogate modeling or just hand writing something with jax/equinox might be easier. You can still integrate the resulting function into a pymc model later with pytensor.wrap_jax or pytensor.wrap_py. There are also more options, like XGBoost, random forests, Chaos Polynomial Expansion etc…
  • A big problem in surrogate modeling is knowing where in the input space you want to train the surrogate model. Since you typically don’t really know where the posterior is (or why would you want to run this model otherwise?) how do you choose the training input data? If you are unlucky, you might end up in a loop, where you have some training data, fit the model with it, only to repeatetly discover that the surrogate model isn’t very good where the sampler ended up. I don’t think it is obvious why first training a surrogate model and then sampling necessarily need fewer full model evaluations rather than just sampling normally in the first place. Not that it can’t be better, but it doesn’t have to be so. It is usually easier to parallelize compared to sampling though.

Sampling with surrogate models is cool, but I think it quickly turns into more of a research project about sampling and function approximation. I don’t think it is a well developed standard method just yet. Before I’d go there, I’d first make sure that just sampling like usual really doesn’t work. How slow is your model function? With that few parameters even gradient-free methods might still have a good chance of working, and SMC or some other population sampler might still work (and those can run in parallel quite well).