# Simple GP model giving large MAP values for covariance timescale

I’m trying to fit a simple GP model with exponential covariance (damped random walk model) with timescale l. However, the MAP values do not seem reasonable, and fail to recover the input values from simulated time series.

The code for the model is:

def fit_drw(t, y, yerr, cadence, baseline, amplitude, precision):

import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt

with pm.Model() as model:

# damped random walk
l = pm.Uniform("l", lower=np.sqrt(1/2), upper=np.sqrt(1e8*baseline/2))
# 2l^2 = tau
# l = sqrt(tau/2)
sigma_drw = pm.Uniform("sigma_drw", lower=0.1*precision, upper=10*amplitude)
cov = 2*sigma_drw**2 * pm.gp.cov.Exponential(1, l)
gp_drw = pm.gp.Marginal(cov_func=cov)

# The Gaussian process is a sum of these three components
gp = gp_drw

# Since the normal noise model and the GP are conjugates, we use Marginal with the .marginal_likelihood method
X = t[:, None]
y_ = gp.marginal_likelihood("y", X=X, y=y, noise=yerr)
mp = pm.find_MAP() #start={'l': 100, 'sigma_drw': amplitude})

# Predict
tpred = np.linspace(np.min(t), np.max(t)+400, 1000)
Xpred = tpred[:, None]

mu, var = gp.predict(Xpred, point=mp, diag=True)
sd = np.sqrt(var)

plt.figure(figsize=(8,4))
plt.plot(tpred, mu, "dodgerblue", lw=3)
plt.fill_between(tpred, mu-sd, mu+sd, color="dodgerblue", alpha=0.2)
plt.errorbar(t, y, yerr=yerr, color="k", linestyle='none', ms=3, alpha=1)
plt.show()

print(mp)

return


The MAP prediction looks reasonable when plotted below. However, the MAP value of l is huge! Why?

{'l_interval__': array(12.60262633), 'sigma_drw_interval__': array(1.12011405), 'l': array(499998.31842261), 'sigma_drw': array(8.61098201)}


For completion, the coed to simulate the time series is:

from astroML.time_series import generate_damped_RW

dt = 50
baseline = 5000
t = np.arange(0, baseline, dt)
y = generate_damped_RW(t, tau=250, xmean=20, SFinf=0.3, z=0.0)

fit_drw(t, y, 0.005*y, cadence=dt, baseline=baseline, amplitude=np.max(y)-np.min(y), precision=0.005)


The details of the generate_damped_RW function should not be a concern, other than the input timescale is given by 2l^2 = tau.

Thanks!

Note I had to make the upper limit on the l prior very large to get a reasonable prediction (otherwise the MAP value of l was just giving the upper-limit).

I messed around with your code and found the same thing. I have a feeling that you’re experiencing the same issue that Michael Betancourt talks about in this blog post. See this thread for some advice. Long story short, priors are important (but especially with GPs)!

Thanks @jlindbloom,

I tried the InverseGamma prior (using the minimum sampling rate and time series length as the lower and upper length scales) and Normal prior for the amplitude term. Now we are getting somewhere, but the recovered l (or tau) seem systematically larger (by a value of ~1e5), and much larger than the time series length! Do you have any ideas what could cause this difference by a multiplicative factor?

Below I show the input timescale vs. recovered MAP timescale after dividing by 1e5:

What precisely do you mean by “input” \tau? Do you mean the true timescale used to generate your training data?

Can you share your code for this?

Yes, input \tau is the the timescale of the simulated time series. Code below:

def fit_drw(t, y, yerr, cadence, baseline, amplitude, precision):

import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt

l_upper = baseline
l_sigma = (l_upper-l_lower)/6
l_mu = l_lower + 3*l_sigma

with pm.Model() as model:

# damped random walk
tau = pm.InverseGamma("tau_drw", mu=l_mu, sigma=l_sigma)
# 2l^2 = tau_drw
# l = sqrt(tau/2)
sigma_drw = pm.Normal("sigma_drw", mu=amplitude, sigma=precision)
cov = 2*sigma_drw**2 * pm.gp.cov.Exponential(1, np.sqrt(tau/2))
gp_drw = pm.gp.Marginal(cov_func=cov)

# white noise
sigma_n = pm.Normal("sigma_n", mu=precision, sigma=precision)
cov = pm.gp.cov.WhiteNoise(sigma_n)
gp_wn = pm.gp.Marginal(cov_func=cov)

# The Gaussian process is a sum of these two components
gp = gp_drw + gp_wn

# Since the normal noise model and the GP are conjugates, we use Marginal with the .marginal_likelihood method
X = t[:, None]
y_ = gp.marginal_likelihood("y", X=X, y=y, noise=0)
mp = pm.find_MAP(start={'tau_drw': 0.25*baseline, 'sigma_drw': amplitude, 'sigma_n':precision})

# Predict
tpred = np.linspace(np.min(t), np.max(t), 1000)
Xpred = tpred[:, None]

mu, var = gp.predict(Xpred, point=mp, diag=True)
sd = np.sqrt(var)

plt.figure(figsize=(12,4))
plt.plot(tpred, mu, "dodgerblue", lw=3)
plt.fill_between(tpred, mu-sd, mu+sd, color="dodgerblue", alpha=0.2)
plt.errorbar(t, y, yerr=yerr, color="k", linestyle='none', ms=3, alpha=1)
plt.show()

return mp

from astroML.time_series import generate_damped_RW

dt = 50
baseline = 5000
t = np.arange(0, baseline, dt)

# Simulate time series at varying tau
tau_outs = []
tau_ins = np.linspace(50, 500, 20)
for tau_in in tau_ins:

y = generate_damped_RW(t, tau=tau_in, xmean=20, SFinf=0.3, z=0.0)

noise_level = 0.005
y += np.random.normal(noise_level)

mp = fit_drw(t, y, noise_level*y, cadence=dt, baseline=baseline, amplitude=np.max(y)-np.min(y), precision=noise_level)
tau_outs.append(mp['tau_drw'])

# Plot input versus MAP tau
plt.scatter(tau_ins, np.array(tau_outs)/1e5)
plt.plot([0, 500], [0, 500], color='k', lw=2)
plt.xlim([0, 500])
plt.ylim([0, 500])
plt.ylabel(r'MAP $\tau / 10^5$', fontsize=16)
plt.xlabel(r'Input $\tau$', fontsize=16)


Update: The multiplicative “correction” factor seems to be dependent on the time series baseline (length). ~20 times the baseline seems to work. Strange. I am still investigating.

With GPs, I would strongly recommend standardizing your input and output variables (subtract their mean and divide by their standard deviation) prior to fitting. I find this makes it easier to specify priors (largely because you can just use off-the-shelf suggestions rather than rescaling things) and to interpret parameter posteriors. I think there’s also something computationally advantageous about picking priors that are closer to “standard” values (e.g., N(0,1) rather than N(1e5,1e4)), something about the tuning, but I could be completely off-base with that. But I do think part of your issue may arise from the fact that your input and output variables lie on vastly different scales, increments of ~100 vs 0.01. That could be why your MAP value for sigma_drw actually also seems 1-2 orders of magnitude larger than I might expect. The interaction between GP variance and lengthscale is complex and somewhat unintuitive; technically speaking, I’m pretty sure variance and lengthscale are unidentifiable (though their ratio is identifiable), meaning there are ~infinite combinations that give identical behavior, so I wouldn’t worry about trying to reproduce their values or draw too much interpretation from their values.

Also why take the square root of tau for the lengthscale? That’s just going to force tau to be larger than necessary, and sort of defeats the purpose of Betancourt’s principled InverseGamma prior.

I’m also suspicious that what you’re seeing is partly to do with the way you are introducing noise to your simulations and specifying noise in your model. It’s not common to account for noise in the GP with a WhiteNoise kernel; typically you would specify a prior for the noise argument to gp.marginal_likelihood, e.g. σ = pm.Exponential('σ', lam=1) ... gp.marginal_likelihood('ml', X=X, y=y, noise=σ). Those may be equivalent formulations, but I’m not sure. I noticed that in your code earlier in the thread, you specified yerr for this argument. This essentially acts as weights to the mean of the GP, and probably isn’t what you want. Additionally, you added noise_level to y in order to jitter those values, but you multiplied it by y to specify yerr, which doesn’t make sense to me.

You’ll want to specify a strictly positive prior for the GP variance (your sigma_drw, more commonly called eta , η, for a GP) as well as noise; for standardized data, Gamma(2,1) works well for variance, Exponential(1) for noise. Because you’re squaring it, a Normal prior will end up having a bimodal posterior around 0, which will just make it unnecessarily difficult for your sampler to converge.

tl;dr: try scaling/standardizing your inputs and outputs, which should allow you to use more standard priors, and specify noise as noise, not a separate kernel.

1 Like

Standardizing the inputs seems like a good idea.

The motivation for taking the square root of the lengthscale is so the inferred lengthscale is the same units as the input times. However, there is nothing wrong with doing the conversion between tau and l afterwords and adopting the standard priors. It also makes the analytic PSD easy to write down. See link below if you are curious about this convention in my field:

The lengthscale is not completely unidentifiable/degenerate with the variance, although the covariance between the two parameters is strong. As my figure above shows, the MAP lengthscale is proportional to the simulated input lengthscale.

Thanks for noticing the noise typo and suggestions on the priors. I’ll keep looking into this.

1 Like

Going to mark this as solved. Indeed something was wrong with my priors initially. MAP timescale still seems off by a factor of a few, but that could be a bias in the MAP estimate perhaps. Also once the timescale gets too long compared to its length (input tau ~ 10 length; gray lines), the time series is not stationary and the MAP timescales saturate, which is a well-known effect. I’m not sure if there is good solution for this.

1 Like