Gaussian Process regression with Automatic Relevance Determination


#1

Currently, I’m considering to use GP regression for a my project. I typically use following script is to train a GP regression model with PyMC3.

X // features
y // target
with pm.Model() as model:

        ℓ = pm.Gamma("ℓ", alpha=2, beta=1)
        η = pm.HalfCauchy("η", beta=5)

        cov = η ** 2 * pm.gp.cov.Matern52(X.shape[1], ℓ)
        gp = pm.gp.Marginal(cov_func=cov)

        σ = pm.HalfCauchy("σ", beta=1)
        y = gp.marginal_likelihood("y", X=X, y=y, noise=σ)

        map_trace = [pm.find_MAP()]

I’m interested in understanding, how to perform automatic relevance determination for GP regression with PyMC3. With basic regression techniques, we could have used some hyper-priors for the coefficients of features, instead of deterministic values. However, with GP regression I don’t see a way of incorporating such hyper-priors to the kernel function.

Do you have any suggestions? I’m appreciate any help on performing automatic features selection within the models to minimize the effect of the unimportant features.

Thanks!


#2

I think you can modify the prior for ℓ and η for that purpose.


#3

I was initially thinking the same approach, but I cannot figure out how that could determine the relevance of the features. Aren’t we suppose to use single dimensional RVs for ℓ and η, assigning a single RV for all the features?


#4

In general you can only easily regularise hyper parameters (priors), and for GP especially that’s what you should do. If you want specific regularization on some range of the output (eg, smaller smoothing in a specific range of Y), then you need to design your cov function.

I guess alternatively, you can also try adding a potential that takes the cov function as input.


#5

Thanks for the suggestion. I’m still not convince this is what I’m looking for.

I don’t want to regularize a specific range. As what I have understood, we usually use ARD to determine the importance of each feature. So in a simple case such as linear regression, we use a hyper-prior assigned to the std of each coefficient to control the effect of each feature towards the predictions.

However, ℓ and η are same for all the features according to usual GP implementation. We do not assign a single ℓ for each features, instead we divide the euclidean distance between two data samples (these operations may vary according to the kernel type) using a single dimensional RV ℓ. So how could that control the effect on individual feature differently (according to the relevance of the each feature)?


#6
with pm.Model() as model:
    ℓ = pm.Gamma("ℓ", alpha=2, beta=1, shape = X.shape[1]]) // I changed this line
    η = pm.HalfCauchy("η", beta=5)

    cov = η ** 2 * pm.gp.cov.Matern52(X.shape[1], ℓ)
    gp = pm.gp.Marginal(cov_func=cov)

    σ = pm.HalfCauchy("σ", beta=1)
    y = gp.marginal_likelihood("y", X=X, y=y, noise=σ)

    map_trace = [pm.find_MAP()]

What do you think will happen if I implement GP as shown above? will this define a ℓ _i for each feature since I have defined ℓ with X.shape[1] number of dimensions. (I changed the second line)


#7

Oh, I did not understand fully what you meant before. So you have multi dimension in X (eg, 2d GP which the mean function would be a surface), then yes you can assign different ℓ and η to regularized the feature in each dimension.


#8

So can I achieve that by simply setting the shape of ℓ and η to X.shape[1] (number of features) ?


#9

I think so.


#10

Do we have someone who is familiarize with the PyMC3 GP implementation? I wanted to verify that setting ℓ and η with a shape of X.shape[1] (number of dimensions) is gonna consider each ℓ_i for i^{th} feature during the computations?


#11

@bwengals


#12

yessir it will. Although for ARD, eta is a scalar.


#13

Thanks @bwengals, I did not change the size of eta, but instead of a scalar I used a RV as shown in the scripts. I assumed that is better than using a scaler according to Bayesian.