Simple GP model giving large MAP values for covariance timescale

burke86 · August 4, 2021, 11:17pm

I’m trying to fit a simple GP model with exponential covariance (damped random walk model) with timescale l. However, the MAP values do not seem reasonable, and fail to recover the input values from simulated time series.

The code for the model is:

def fit_drw(t, y, yerr, cadence, baseline, amplitude, precision):
        
        import pymc3 as pm
        import numpy as np
        import matplotlib.pyplot as plt
        
        with pm.Model() as model:
            
            # damped random walk
            l = pm.Uniform("l", lower=np.sqrt(1/2), upper=np.sqrt(1e8*baseline/2))
            # 2l^2 = tau
            # l = sqrt(tau/2)
            sigma_drw = pm.Uniform("sigma_drw", lower=0.1*precision, upper=10*amplitude)
            cov = 2*sigma_drw**2 * pm.gp.cov.Exponential(1, l)
            gp_drw = pm.gp.Marginal(cov_func=cov)

            # The Gaussian process is a sum of these three components
            gp = gp_drw 

            # Since the normal noise model and the GP are conjugates, we use `Marginal` with the `.marginal_likelihood` method
            X = t[:, None]
            y_ = gp.marginal_likelihood("y", X=X, y=y, noise=yerr)
            mp = pm.find_MAP() #start={'l': 100, 'sigma_drw': amplitude})
            
            # Predict
            tpred = np.linspace(np.min(t), np.max(t)+400, 1000)
            Xpred = tpred[:, None]
            
            mu, var = gp.predict(Xpred, point=mp, diag=True)
            sd = np.sqrt(var)
            
            plt.figure(figsize=(8,4))
            plt.plot(tpred, mu, "dodgerblue", lw=3)
            plt.fill_between(tpred, mu-sd, mu+sd, color="dodgerblue", alpha=0.2)
            plt.errorbar(t, y, yerr=yerr, color="k", linestyle='none', ms=3, alpha=1)
            plt.show()
                        
            print(mp)
            
        return

The MAP prediction looks reasonable when plotted below. However, the MAP value of l is huge! Why?

{'l_interval__': array(12.60262633), 'sigma_drw_interval__': array(1.12011405), 'l': array(499998.31842261), 'sigma_drw': array(8.61098201)}

For completion, the coed to simulate the time series is:

from astroML.time_series import generate_damped_RW

dt = 50
baseline = 5000
t = np.arange(0, baseline, dt)
y = generate_damped_RW(t, tau=250, xmean=20, SFinf=0.3, z=0.0)

fit_drw(t, y, 0.005*y, cadence=dt, baseline=baseline, amplitude=np.max(y)-np.min(y), precision=0.005)

The details of the generate_damped_RW function should not be a concern, other than the input timescale is given by 2l^2 = tau.

Thanks!

burke86 · August 4, 2021, 11:42pm

Note I had to make the upper limit on the l prior very large to get a reasonable prediction (otherwise the MAP value of l was just giving the upper-limit).

jlindbloom · August 5, 2021, 2:00am

I messed around with your code and found the same thing. I have a feeling that you’re experiencing the same issue that Michael Betancourt talks about in this blog post. See this thread for some advice. Long story short, priors are important (but especially with GPs)!

burke86 · August 5, 2021, 2:47am

Thanks @jlindbloom,

I tried the InverseGamma prior (using the minimum sampling rate and time series length as the lower and upper length scales) and Normal prior for the amplitude term. Now we are getting somewhere, but the recovered l (or tau) seem systematically larger (by a value of ~1e5), and much larger than the time series length! Do you have any ideas what could cause this difference by a multiplicative factor?

Below I show the input timescale vs. recovered MAP timescale after dividing by 1e5:

jlindbloom · August 5, 2021, 2:56am

What precisely do you mean by “input” \tau? Do you mean the true timescale used to generate your training data?

Can you share your code for this?

burke86 · August 5, 2021, 3:01am

Yes, input \tau is the the timescale of the simulated time series. Code below:

def fit_drw(t, y, yerr, cadence, baseline, amplitude, precision):
        
        import pymc3 as pm
        import numpy as np
        import matplotlib.pyplot as plt
                
        l_lower = cadence
        l_upper = baseline
        l_sigma = (l_upper-l_lower)/6
        l_mu = l_lower + 3*l_sigma
        
        with pm.Model() as model:
            
            # damped random walk
            tau = pm.InverseGamma("tau_drw", mu=l_mu, sigma=l_sigma)
            # 2l^2 = tau_drw
            # l = sqrt(tau/2)
            sigma_drw = pm.Normal("sigma_drw", mu=amplitude, sigma=precision)
            cov = 2*sigma_drw**2 * pm.gp.cov.Exponential(1, np.sqrt(tau/2))
            gp_drw = pm.gp.Marginal(cov_func=cov)

            # white noise
            sigma_n = pm.Normal("sigma_n", mu=precision, sigma=precision)
            cov = pm.gp.cov.WhiteNoise(sigma_n)
            gp_wn = pm.gp.Marginal(cov_func=cov)

            # The Gaussian process is a sum of these two components
            gp = gp_drw + gp_wn

            # Since the normal noise model and the GP are conjugates, we use `Marginal` with the `.marginal_likelihood` method
            X = t[:, None]
            y_ = gp.marginal_likelihood("y", X=X, y=y, noise=0)
            mp = pm.find_MAP(start={'tau_drw': 0.25*baseline, 'sigma_drw': amplitude, 'sigma_n':precision})
            
            # Predict
            tpred = np.linspace(np.min(t), np.max(t), 1000)
            Xpred = tpred[:, None]
            
            mu, var = gp.predict(Xpred, point=mp, diag=True)
            sd = np.sqrt(var)
            
            plt.figure(figsize=(12,4))
            plt.plot(tpred, mu, "dodgerblue", lw=3)
            plt.fill_between(tpred, mu-sd, mu+sd, color="dodgerblue", alpha=0.2)
            plt.errorbar(t, y, yerr=yerr, color="k", linestyle='none', ms=3, alpha=1)
            plt.show()
        
        return mp

from astroML.time_series import generate_damped_RW

dt = 50
baseline = 5000
t = np.arange(0, baseline, dt)

# Simulate time series at varying tau
tau_outs = []
tau_ins = np.linspace(50, 500, 20)
for tau_in in tau_ins:

    y = generate_damped_RW(t, tau=tau_in, xmean=20, SFinf=0.3, z=0.0)

    noise_level = 0.005
    y += np.random.normal(noise_level)

    mp = fit_drw(t, y, noise_level*y, cadence=dt, baseline=baseline, amplitude=np.max(y)-np.min(y), precision=noise_level)
    tau_outs.append(mp['tau_drw'])

# Plot input versus MAP tau
plt.scatter(tau_ins, np.array(tau_outs)/1e5)
plt.plot([0, 500], [0, 500], color='k', lw=2)
plt.xlim([0, 500])
plt.ylim([0, 500])
plt.ylabel(r'MAP $\tau / 10^5$', fontsize=16)
plt.xlabel(r'Input $\tau$', fontsize=16)

burke86 · August 5, 2021, 3:25am

Update: The multiplicative “correction” factor seems to be dependent on the time series baseline (length). ~20 times the baseline seems to work. Strange. I am still investigating.

BioGoertz · August 5, 2021, 11:02am

With GPs, I would strongly recommend standardizing your input and output variables (subtract their mean and divide by their standard deviation) prior to fitting. I find this makes it easier to specify priors (largely because you can just use off-the-shelf suggestions rather than rescaling things) and to interpret parameter posteriors. I think there’s also something computationally advantageous about picking priors that are closer to “standard” values (e.g., N(0,1) rather than N(1e5,1e4)), something about the tuning, but I could be completely off-base with that. But I do think part of your issue may arise from the fact that your input and output variables lie on vastly different scales, increments of ~100 vs 0.01. That could be why your MAP value for sigma_drw actually also seems 1-2 orders of magnitude larger than I might expect. The interaction between GP variance and lengthscale is complex and somewhat unintuitive; technically speaking, I’m pretty sure variance and lengthscale are unidentifiable (though their ratio is identifiable), meaning there are ~infinite combinations that give identical behavior, so I wouldn’t worry about trying to reproduce their values or draw too much interpretation from their values.

Also why take the square root of tau for the lengthscale? That’s just going to force tau to be larger than necessary, and sort of defeats the purpose of Betancourt’s principled InverseGamma prior.

I’m also suspicious that what you’re seeing is partly to do with the way you are introducing noise to your simulations and specifying noise in your model. It’s not common to account for noise in the GP with a WhiteNoise kernel; typically you would specify a prior for the noise argument to gp.marginal_likelihood, e.g. σ = pm.Exponential('σ', lam=1) ... gp.marginal_likelihood('ml', X=X, y=y, noise=σ). Those may be equivalent formulations, but I’m not sure. I noticed that in your code earlier in the thread, you specified yerr for this argument. This essentially acts as weights to the mean of the GP, and probably isn’t what you want. Additionally, you added noise_level to y in order to jitter those values, but you multiplied it by y to specify yerr, which doesn’t make sense to me.

You’ll want to specify a strictly positive prior for the GP variance (your sigma_drw, more commonly called eta , η, for a GP) as well as noise; for standardized data, Gamma(2,1) works well for variance, Exponential(1) for noise. Because you’re squaring it, a Normal prior will end up having a bimodal posterior around 0, which will just make it unnecessarily difficult for your sampler to converge.

tl;dr: try scaling/standardizing your inputs and outputs, which should allow you to use more standard priors, and specify noise as noise, not a separate kernel.

burke86 · August 5, 2021, 11:28pm

Thanks for your reply @BioGoertz.

Standardizing the inputs seems like a good idea.

The motivation for taking the square root of the lengthscale is so the inferred lengthscale is the same units as the input times. However, there is nothing wrong with doing the conversion between tau and l afterwords and adopting the standard priors. It also makes the analytic PSD easy to write down. See link below if you are curious about this convention in my field:

The lengthscale is not completely unidentifiable/degenerate with the variance, although the covariance between the two parameters is strong. As my figure above shows, the MAP lengthscale is proportional to the simulated input lengthscale.

Thanks for noticing the noise typo and suggestions on the priors. I’ll keep looking into this.

burke86 · August 6, 2021, 4:17am

Going to mark this as solved. Indeed something was wrong with my priors initially. MAP timescale still seems off by a factor of a few, but that could be a bias in the MAP estimate perhaps. Also once the timescale gets too long compared to its length (input tau ~ 10 length; gray lines), the time series is not stationary and the MAP timescales saturate, which is a well-known effect. I’m not sure if there is good solution for this.

Topic		Replies	Views
GP regularization effect Questions	6	511	January 11, 2021
Gaussian Process Regression level 1 inference: Re-producing Mauna Loa CO2 Example with PyMC3 Questions	10	2225	August 16, 2017
Multiple (uncertain) function observations of the same Gaussian process Questions	15	6546	October 16, 2017
MAP not working with shape>1 and VI on a pooled model Questions	11	832	November 25, 2017
Gaussian Process and magnitude of Y Questions	3	879	December 9, 2019

Simple GP model giving large MAP values for covariance timescale

Related topics