Multidimensional gaussian process

@Actionhank

  • That’s right. I have distributions on each of my model parameters and wanted to be able to apply a different distribution to each one if needed.
  • givens should contain the X and y values that your gp was fit to while params_ are the new X values (X* in the documentation) to generate your new y values (y*).

I’m not exactly sure I understand your use case but what I am doing is something like:

#Define my random variables
params_ = theano.stack([my, random, variables, ...]).reshape(1,number_of_rvs)
#Build surrogate...
y = pm.SomeDistribution('y_', mu = surrogate, y=observed_y, **other_kwargs)

Does that make sense?

Ok, that is basically my implementation at the moment. My problem is always that I get an input size mismatch error. E.g. my parameter is 1D and I have ‘n=100’ datapoints in ‘y_observed’:

params_ = pm.HalfNormal('param1', sd=1, shape=(1,1))
f_surrogate = pm.Deterministic('f_surrogate', gp._build_conditional(params_, *giveens)[0])
y_ = pm.Normal('y_', mu=f_surrogate, sd=0.1, observerd=y_observed)

I get the error:

Input dimension mis-match. (input[0].shape[0] = 100, input[1].shape[0] = 1)

What line is the error happing on? If it is on

y_ = pm.Normal(...

maybe you need to flatten/reshape y_observed?

OK,

y.flatten()

did the job. At least the sampling now runs, although without convergence (as expected :wink: ). Hopefully my implementation is correct. Now the hard work has to be done, i.e. choosing the right priors for convergence.

Thanks again @NateAM.

Great! Glad to help!

Hey @NateAM,

Thanks for this great write up - I know this is a little old now but would you mind attaching the latest working file so I can work my way through it line by line.

Thanks!

Hi @Fresch,

Which aspect of the problem are you interested in? Fitting the GP or using an already fit GP as a surrogate?

Thanks for the super quick reply @NateAM!

I’m attempting to fit a model as follows

def model_desired(xd, params):
    return np.power(xd[:,0], params[0]) + np.power(xd[:,1], params[1])

where xd is the observed data (n > 1) (in the above example n=2 dimens but in reality it will be 5-10 dimensions) and params will be n x 1 (again, 2x1 in the above example).

but conceptually I’m unsure how to structure my surrogate model (specifically the kernel for my (desired) model rather than the existing cov kernel).

Any help would be great - thanks again!!

I’ve attached a sample with what I’m trying to achieve based on the examples above
pcmc1.py (2.1 KB)

@Fresch
Unfortunately, I’m not able to share my code anymore (my employer has a restriction on what I can distribute) but I can give you some ideas and pseudo-code which I hope will help.

In my application, I don’t (currently) care about the uncertainty on my hyper parameters for my GP. I just need the best possible fit. This is because I’m constructing a surrogate model of the likelihood directly rather than trying to build a surrogate of my model response.

I’m adapting a modification of this paper (https://arxiv.org/abs/1703.09930) to do my sampling.

The process is as follows:
First, you construct a function that evaluates the log-likelihood of a given parameter set given the data. You can do this directly using pymc3 by doing:

y = theano.shared(some_model_evaluation)
sigma = 1.0 #Assuming sigma isn't a parameter though you could make it one
    
with pymc3.Model() as model:
    likelihood = pymc3.Normal('likelihood', mu=y, sd=sigma, observed=data)

logp = model.logp

#Now evaluate the logp for the given expensive model evaluation
some_model_evaluation_logp = logp()

#To evaluate another model run do
y.set_value(some_other_model_evaluation)
some_other_model_logp = logp()

From this you can build up two arrays: One of parameter sets, the other is a vector of corresponding log-likelihoods.

You can now fit a Gaussian process to the log-likelihood as a function of the parameters. I actually do this in GPy rather than pymc3 because, like I mentioned, uncertainty on my hyperparameters is not something I’m really worried about. You could also fit it in scikit-learn. Anywhere where you can either write out the GP kernel explicitly or just use the kernels already available in pymc3.

Now you have a function that maps parameters to the log likelihood so you can construct a pymc3 model via:

#Solve for alpha based on a sampling of parameters and log-likelihood evaluations (see algorithm 2.1 http://www.gaussianprocess.org/gpml/chapters/RW2.pdf)
alpha = ...

import theano.tensor as tt
with pm.Model() as model:
    #define your priors
    a_ = pm.Normal(...
    b_ = pm.Beta(...
    ...
    #Stack the variables
    params_ = tt.stack([a_,b_, ...],axis=0).reshape([1,nparams])

    #Evaluate the log-likelihood
    Ks = my_cov.K(parameter_samples, params_)
    likelihood = pm.Potential('likelihood', tt.dot(Ks.T, alpha))

    #Perform your sampling
    trace = pm.sample(...)

Most of the pitfalls with this approach (that I’ve found) are centered around the surrogate:

  1. I’ve found you need to have samples all around your optimum to get convergence to good values. Local minimums are everywhere.
  2. Fitting the Gaussian process can be non-trivial depending on your log-likelihood. You should consider standardizing your data (subtract the mean and divide by your standard deviation BEFORE you fit your GP. Then when after you evaluate your GP you multiply the response by the standard deviation and add back the mean).
  3. Outliers in the log-likelihood can really kill your GP fit.

Let me know if any of that is unclear or if you need any further help.

Good luck!

2 Likes

Thank you for the detailed answer. Could you please tell me why the “y=np.asarray([0.])”.
Could I set y as the y* with shape (number of samples, 1)? thx

1 Like

Hey @NateAM ,
I know this thread is pretty old but I have a problem somewhat similar to what you have experienced. I have created a GP model in GPy and am trying to get parameter distributions in pymc3. I am using a normal likelihood function and use the GPy model for the mean predictions for mu. I had a problem with using theano tensors in GPy functions to get predictions, covariances, etc. Have you encountered the same problem or do you know a solution to the problem?

@trot Sorry it took me a bit to get back to you.

I think your use case is a bit different than mine but we can try talking through it. So, from what you wrote, you have fit a GP model in GPy and would like to sample using that model in some way in pymc3 either by using it to represent some quantity of interest or to model some complicated likelihood function directly is that correct?

If you are trying to use a GPy fit directly in a pymc3 model I think you will run into some difficulties because pymc3 requires the gradients w.r.t. the parameters in order to use NUTS which I don’t believe GPy can give you. If you still want to do that you will need to wrap the function using something like this approach for a black-box likelihood function.

Note: If you use this approach you will have to use sub-optimal samplers!

If you are trying to re-create GPy in pymc3 ( which I have done to some extent ) you can try and implement the basic Gaussian process model directly using Theano types. If that’s the case, here is the textbook I used when I was doing this myself.

As an alternative ( and what I would suggest you consider ):
You could use the python Surrogate Modeling Toolbox which has all kinds of surrogate models and, most crucially, also includes their gradients. You can wrap these functions using the black-box tutorial, and a bit of finagling, and have a collection of surrogate models that work directly in pymc3.

Hopefully that is helpful!

3 Likes

@NateAM Sorry for the late response, I went on break just as you responded.

I looked over and tried the different options you suggested and decided that I am just going to recreate the GP model in PyMC3. Just as a refresher, I am trying to do level 1 inference on the (13) parameters of my data and I do not really care about the distributions of my hyperparameters. I have a couple of questions and just want to see if I am heading in the right direction with my code.

In your past tutorial comments, you have used a custom likelihood function for Bayesian inference. I am using the following likelihood function:
image
so I am using the MvNormal(Normal for testing) likelihood function. I have already built my GP model in pymc3 but am having problems with the Bayesian inference side. I have the following model:

with pm.Model() as bamodel:
    var1 = pm.Uniform('rhoe', lower=16.0, upper=18.0, testval=16.94)
    var2 = pm.Uniform('alpha', lower=15.8, upper=19.2, testval=17.5)
    ...
    var13 = pm.Uniform('var13', lower=6.0, upper= 10.2, testval=8.16)
    theta = tt.stack(var1, var2,  \
            ... var13], axis=0).reshape([1, 13])
    mean, cov = gp.predict(theta, point=mp, diag=True) # gp.predictt(theta, diag=True)
    sigNoise = pm.HalfNormal('sigNoise', sigma=0.5, testval=0.05)
    cov_noise = pm.Deterministic("cov_noise", (cov + sigNoise**2)) 
    likelihood = pm.Normal("likelihood", mu=mean, tau=cov_noise, observed=yobs)
    trace = pm.sample(1000, chains=2, cores=1, init='adapt_diag', start=start_vals)

I created a separate model for the level 1 inference and would call to the GP model using the predictt function (outputs symbolic mean prediction and cov) but I kept getting a Theano error: “Input 0 of the graph (indices start from 0), used to compute Elemwise…” I have replaced it with gp.predict but this seems to just draw from the prior since the predict function outputs a deterministic value. Is this code in the right direction?

Thank you again for your last comment, every amount of information was extremely helpful.

When I was using the gaussian processes from pymc3 I did something like this:

where for you _params would be theta and the values of surrogate_params were a dictionary of the the parameters from my original fitting process of the surrogate. Unless the pymc3 interface for GP has changed this should work.

As I’m thinking about it, one more thing I would strongly suggest you consider is to map your input parameters to the hypercube through a Copula ( see while my_mcmc:  gently(samples) - An intuitive, visual guide to copulas ). In this way your variables will all vary from 0 to 1. You would do something like:

from scipy import stats

#The true distribution of my parameters
dist1 = stats.norm( loc = 8.7, scale = 2e1 )
dist2 = stats.something_crazy( crazy parameters! )
...
distributions = [ dist1, dist2, ... ]

#Evaluate my function on samples from my parameters
outputs = np.array( [ f( xi ) for xi in X ] ) #X is a matrix ( n samples, n parameters ) #matrix ( n samples, n outputs )

#Map the parameters to the hypercube
mapped_X = np.vstack( [ dist.cdf( X[ :, i ] ) for i, dist in enumerate( distributions ) ] ).T

#Normalize your outputs by subtracting the mean and ( maybe ) dividing by the standard deviation
mean_y = outputs.mean( 0 ) #Definitely do this
std_y = outputs.std( 0 ) #Maybe do this
norm_outputs = ( outputs - mean_y ) / std_y

#Don't forget to do this too
norm_data = ( data - mean_y ) / std_y

#Fit your GPs
surrogate_parameters = [ fit_gp( mapped_X, norm_outputs[ :, i ] ) for i in range( outputs.shape[ 1 ] ) ]

#Then in pymc3
with pm.Model( ) as model:
    _x = pm.Uniform( "mapped_x", lower = 0, upper = 1, shape =  ( X.shape[ 1 ], ) )
    
vars = [ { } for _ in range( norm_outputs.shape[ 1] ) ]

for i, sp in enumerate( surrogate_params ):

    with model:
        vars[ i ].update( { "cov_func":sp[ "eta" ].reshape( [1,1 ] )**2.*sp[ "covfxn" ]( X.shape[ 1 ], ls=sp[ls]) } )
        vars[ i ].update( { "gp":pm.gp.Latent( cov_func = vars[ i ][ "cov_func" ] ) } )
        vars[ i ].update( { "givens":gp._get_given_vals({'X':mapped_X,'f':norm_outputs[ :, i ],'y':norm_outputs[ :, i ],'noise':sp['noise']}) } )
        vars[ i ].update( { "surrogate":pm.Deterministic('surrogate_{0}'.format( i ),gp._build_conditional(_x,*vars[ i ][ "givens" ])[0]) } )
        vars[ i ].update( { "eps":pm.HalfNormal( sd = 1 ) } ) #Model form uncertainty
        vars[ i ].update( { "likelihood".pm.Normal( "likelihood_{0}".format( i ), mu = vars[ i ][ "surrogate" ], sd = vars[ i ][ "eps" ], observed = norm_data[ :, i ] ) } )

with model:
    trace = pm.sample( 1000 )

Thats at least how I do things now except I use the surrogate modeling toolbox ( SMT: Surrogate Modeling Toolbox — SMT 2.2.1 documentation ) with a theano wrapper.

1 Like

I just want to start by saying thank you for your help, it has been a game-changer for me. However, I have encountered a slurry of problems that I would imagine you have encountered before and I want to know how you went about handling these types of errors. I currently still have GPs being built in PyMC3 which are giving me pretty good results (R2>.99), but once I start guiding Bayesian I keep encountering errors like “Mass matrix contains zeros on the diagonal”. When I look up what usually causes this error it is usually from having a bad model, but the GP models seem to predict very well. Also, the time that it takes the trace to run increases a lot as more samples are taken (ex. 1s/iter when at under 500 draws but increases to 10s/iter as the trace reaches 1000 draws and eventually fails). I am currently using NUTS and have worked with the default init but also init=‘adapt_diag’. I was wondering how you moved past these errors.

On another data set that I was working on, I get an error in the linalg side computing the Cholesky decomp while establishing the model in _build_conditional with an error that says “n-th leading minor of the array is not positive definite”. I am not really sure how to handle this since the data was created in a similar way with a different optimized parameter set.

I am not sure if you have seen these errors before, but feel free to give me any incites that you know about getting better convergence or just dealing with these errors directly. Thank you for everything you have done so far.

I’m glad it has been helpful!

I have definitely encountered some of those problems. I don’t know that I’ll have much in the way of solutions, but I can at least let you know some of the things that I tried when I got stuck in a similar way.

For the first problem:

I get mass matrix problems when I have parameters that are very different in scale i.e. order 1 and order 1e-3 or something like that. Normalizing your parameters or re-parameterizing the problem can help here. I can also get them when some of my independent parameters aren’t really important ( i.e. my problem is degenerate ). It can be hard to figure out which of your parameters is the bad actor but something you can do, which I’ve had some success with, is to pre-process my data to make it easier to work with.

By that I mean, I’ll do PCA on my outputs to find the directions of maximum variance and the singular values and then do CCA on my inputs to determine which variables matter for my outputs. Scikit-learn has both of these tools available. Lets say you have N collections of P parameters and a corresponding matrix of N collections of O outputs. If you perform PCA on the output N x O matrix you will get the singular values for each parameter set. It may be that you can reduce the number of outputs from O to o. Now, using your incoming parameter matrix ( N x P in size ) you can use CCA with your reduced output matrix ( N x o ) to determine which incoming variables are important to the outputs. It may be that you can reduce the output parameter space from P to p.

Now you can form your GP mapping from the matrix ( N x p ) to ( N x o ) where the parameters and outputs should be better conditioned.

For the second problem:

One reason the covariance matrix becomes ill-conditioned is if the distance between your data points is much less than your lengthscale term. There are lots of different ways to handle these duplicate ( in the sense of what the GP can see ) points but I would try averaging them or just throwing away the offending point(s) to see if that helps. There are more statistically rigorous ways to do this but this should help diagnose the problem.

3 Likes