GP and Bayesian linear regression in PyMC


I have the problem of how to combine a bayesian linear regression and gaussian process. As an example, I have two features, a 2D spatial GP trying to fit a model as:

y \sim \alpha+\beta_0 x_0 + \beta_1 x_1 + GP_{spatial}

Created the following code in PyMC but still not sure about the implementation:

X = data[['x1', 'x2']].T.values
X_gp = data[['gp0', 'gp1']].values
y = data.y.values

with pm3.Model() as model_mlr:
    ## Bayesian linear regression
    Ξ±_blr = pm3.Normal('Ξ±_blr', mu=0, sd=10)
    Ξ²_blr = pm3.Normal('Ξ²_blr', mu=0, sd=1, shape=num_features)
    Οƒ = pm3.HalfCauchy('Οƒ', 5)
    ΞΌ_blr = Ξ±_blr +Ξ²_blr, X)

    ## The spatial GP
    Ξ·_spatial_trend = pm3.Uniform('Ξ·_spatial_trend', lower=0, upper=100)
    β„“_spatial_trend = pm3.Uniform('β„“_spatial_trend', lower=0, upper=100, shape=gp_input_dim)
    cov_spatial_trend = (
        Ξ·_spatial_trend**2 *, ls=β„“_spatial_trend)
    gp_spatial_trend =
    gp_spatial_func = gp_spatial_trend.prior('gp_spatial_func', X=X_gp)
    # noise model
    cov_noise =Οƒ)

    y_hat = pm3.Normal(
        mu=ΞΌ_blr + gp_spatial_func, 

    start = None
    step = pm3.Metropolis()
    trace_mlr = pm3.sample(30000, start=start, step=step, burn=10000)
    chain = trace_mlr[-1000:]

  • Should I use gp.Latent or gp.Marginal?

  • How prediction works in this case?
    when I use gp.Latent, I get complain about not converging.

Here is a sample data

sample_data.csv (37.8 KB)


Yes you should be able to use Marginal here, since your observed variable has a normal distribution. Put your BLR code into the Linear mean function. Your model is equivalent to

y \sim \mathcal{GP}(\alpha + \beta_0 x_0 + \beta_1 x_1\,, K(x_{sp}\,, x_{sp}')) + \epsilon

where \alpha + \beta_0 x_0 + \beta_1 x_1 is the mean function. Is your data on an equispaced two dimensional grid? If so, check out the MarginalKron implementation.


Thanks. I tried:

gp\_spatial\_trend =\ΞΌ\_blr), cov\_func=cov\_spatial\_trend)

and then

y\_hat = gp\_spatial\_trend.marginal\_likelihood("y\_hat", X=X\_gp, y=y, noise=cov\_noise)

but getting ValueError: shapes (400,2) and (400,) not aligned: 2 (dim 1) != 400 (dim 0).
BTW, if we change from Normal distribution to Poisson, how to rewrite the original code?


If you use a Poisson distribution, you will need to use Latent. The naming of marginal_likelihood refers to the fact that your observed y variable is distributed as

p(y \mid X) = \int p(y \mid f, X) p(f \mid X) df

where f is the GP being marginalized out. This integral is possible if p(y \mid f, X) is normal, since it is conjugate with p(f \mid X), which is the distribution over the Gaussian process prior. This is why Latent uses the naming β€˜prior’.

Check the documentation of Linear, it’s inputs are the coefficients, and the intercept. So your call
should look more like The X you use will need to have four columns, [x_0, x_1, x_{gp1}, x_{gp2}]. The following is pseudocode:

X = [x_0, x_1, x_{gp1}, x_{gp2}]
beta = [beta_0, beta_1, 0, 0]  # the last two columns are full of zeros
mean_func =
# note `active_dims`.  the cov function only uses the last two columns of `X`.
# `input_dim=4` because X has four columns
cov_func =, ls=ls, active_dims=[2,3]) 

# for normally distributed y
gp =, cov_func=cov_func)
y_ = gp.marginal_likelihood("y", X=X, y=y, noise=sigma)

# for poisson distributed y
gp =, cov_func=cov_func)
f = gp.prior("f", X=X)
y_ = pm.Poisson("y", mu=tt.exp(f), observed=y)

Also, I noticed you were using the Metropolis sampler. Using NUTS is highly recommended for something like this, which is what the default call of sample uses.




It was a life saving. Thanks again.
When I use Latent + Posision with NUTS (5000 samples) I receive this warning:

logp = -1,074.5, ||grad|| = 1.295: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 194/194 [00:07<00:00, 24.94it/s] Multiprocess 
sampling (2 chains in 2 jobs) NUTS: [f_rotated_, β„“_spatial_trend, Ξ·_spatial_trend, Οƒ, Ξ²_blr1, Ξ²_blr0, Ξ±_blr] 
Sampling 2 chains: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5000/5000 [1:39:06<00:00, 1.19draws/s] The gelman-rubin 
statistic is larger than 1.4 for some parameters. The sampler did not converge. The estimated number of effective 
samples is smaller than 200 for some parameters.

I guess since the number of samples is high the kernel function is too big and hence sampling is hard and too time consuming (or it may need more samples). Am I right?


What was your call to sample here?


Missed this part:

start = pm3.find_MAP()
step = pm3.NUTS(scaling=start)
trace_mlr = pm3.sample(2000, start=start, step=step, njobs=2)
chain_latent = trace_mlr[-1000:]


You should not use find_MAP to get initial starting point and scaling for NUTS, that usually is a very bad idea.
You should always start with the default:

with model:
    trace = pm.sample(2000, tune=1000, cores=2)

Also, burn-in and thinning is usually not needed using NUTS - if the trace looks like it needs burn-in, you should increase the tuning, if the trace looks like it needs thinning, you should further reparameterized.