Diagnosing AR(1) model whose results are way off course

Hi all,
I’ve been comparing techniques to estimate the standard error of the mean for highly autocorrelated data such as timeseries. In the meantime I found some weird behaviour that has me questioning the paramerization and I’d appreciate some help. PyMC3 models perform well at low autocorrelation but, at high autocorrelation, the estimation of the mean is way off, where frequentist techniques fare much better. I’d expect the bayesian approach to perform at least comparatively. I have code and a random seed to reproduce below, to see it go to:

.

I’m using a stationary AR(1) model with a mean of zero, in which each measurement is the previous measurement multiplied by some parameter, rho, plus some noise. By the way, the autocorrelated curve that is generated in the tutorial is biased since it starts at zero (https://docs.pymc.io/notebooks/AR.html), unlike this function. If the process was initiated in the infinite past then you would never see the origin. This simulated data uses a ‘runup’ to avoid that:

####Generating correlated curve:
np.random.seed(2) 
def gen_correlated_curve(rho, num=100):
    num_with_runup = num+5000 #avoid seeing the origin
    y = np.zeros((num_with_runup,))
    for i in range(1,num_with_runup):
        y[i] = rho * y[i-1] + np.random.normal()
    return y[-num:]

timeseries = gen_correlated_curve(0.99, 3000)
print(timeseries.mean())

out:

-2.7112278332640463

So we’re looking for a mean in the ballpark of -2.7, and hopefully the HPD encompasses the true mean of 0. The PyMC3 model has a ‘center’ parameter that shifts the timeseries back to zero, to account for cases where the timeseries is stationary but not centred on zero:

with pm.Model() as ar1:
    rho_ = pm.Uniform('rho',-1,1) #we assume process is stationary, so -1<k_<1
    tau_ = pm.Gamma('tau',mu=1,sd=1) #precision of innovation
    center = pm.Normal('center', mu=timeseries.mean(), sigma=5) #set the prior for the true mean to be initially centred on the population mean
    likelihood = pm.AR1('likelihood', k=rho_, tau_e=tau_, observed=timeseries-center)
    trace = pm.sample()

Seems to run without error. But the HPD for the center is way off:

pm.stats.hpd(trace['center'])

out:

array([-13.48658459,  -9.38930669])

In case you think the data are just difficult, that’s not the case. Calculating the statistical inefficiency (equivalent to the time constant in this case) captures zero in the CI easily.

z95 = 1.959963984540054

#this is the Chodera lab 'statistical inefficiency' measure.
def statistical_inefficiency(corr, mintime=3):
    N = corr.size
    C_t = sm.tsa.stattools.acf(corr, fft=True, unbiased=True, nlags=N)
    t_grid = np.arange(N).astype('float')
    g_t = 2.0 * C_t * (1.0 - t_grid / float(N))
    ind = np.where((C_t <= 0) & (t_grid > mintime))[0][0]
    g = 1.0 + g_t[1:ind].sum()
    return max(1.0, g)

 #and these actually return the values for SEM:
def sem_from_chodera(timeseries):
    autocorrelation_time = statistical_inefficiency(timeseries)
    n = len(timeseries)
    #calculate SEM and return CI:
    sem = np.std(timeseries) / np.sqrt(n/autocorrelation_time)
    return timeseries.mean()-sem*z95, timeseries.mean()+sem*z95

print(sem_from_chodera(timeseries))

out:

(-5.771172458391055, 0.34871679186296234)

So this CI captures the true mean of zero. This technique is from Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations, but similar techniques for the ‘effective sample size’ used in MCMC also give good results (i.e. Sokal), as does block averaging, as does maximum likelihood in combination with the analytical expression for correcting the SEM of timeseries (I can post these too if you wish).

I’d really appreciate any thoughts whatsoever on why pymc3 misses, and thanks for your time!

OK looks like I’ve got a feasible explanation - and fixing it does solve this problem - the AR(1) variance might be defined incorrectly.

The logp function of AR1 has a boundary term which gets applied to the first entry of the observed data:

boundary = Normal.dist(0., tau=tau_e).logp
....
return boundary(x[0]) + tt.sum(innov_like)

Here the boundary is using the precision of the innovation term. The variance of an AR(1) process is not equal to the variance of the innovation term. Assuming the process was initiated in the infinite past, we can expect the first value to be consistent with the variance of the whole AR(1) process. The variance of an AR(1) process is:

variance = 1 / (1-rho**2)

Where rho is the autocorrelation parameter, which is called k in the AR(1) specification.

Changing the boundary to use the correct standard deviation, derived from that expression above, solves the issue. That is:

##removed this:
#boundary = Normal.dist(0., tau=tau_e).logp
##added this:
var_ar1 = 1 / ((1-k**2)*tau_e)
sd_ar1 = tt.sqrt(var_ar1)
boundary = pm.Normal.dist(0., sigma=sd_ar1).logp

Here’s the full distribution definition and output, which correctly encompasses the true mean (zero):

class fixedAR1(pm.distributions.Continuous):
    """
    Autoregressive process with 1 lag.
    Parameters
    ----------
    k: tensor
       effect of lagged value on current value
    tau_e: tensor
       precision for innovations
    """

    def __init__(self, k, tau_e, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.k = k = tt.as_tensor_variable(k)
        self.tau_e = tau_e = tt.as_tensor_variable(tau_e)
        self.tau = tau_e * (1 - k ** 2)
        self.mode = tt.as_tensor_variable(0.)

    def logp(self, x):
        """
        Calculate log-probability of AR1 distribution at specified value.
        Parameters
        ----------
        x: numeric
            Value for which log-probability is calculated.
        Returns
        -------
        TensorVariable
        """
        k = self.k
        tau_e = self.tau_e

        x_im1 = x[:-1]
        x_i = x[1:]
        
        ##removed this:
        #boundary = Normal.dist(0., tau=tau_e).logp
        ##added this:
        var_ar1 = 1 / ((1-k**2)*tau_e)
        sd_ar1 = tt.sqrt(var_ar1)
        boundary = pm.Normal.dist(0., sigma=sd_ar1).logp

        innov_like = pm.Normal.dist(k * x_im1, tau=tau_e).logp(x_i)
        return boundary(x[0]) + tt.sum(innov_like)

with pm.Model() as ar1:
    rho_ = pm.Uniform('rho',-1,1) #we assume process is stationary, so -1<k_<1
    tau_ = pm.Gamma('tau',mu=1,sd=1)
    center = pm.Normal('center', mu=timeseries.mean(), sigma=5) #set the prior for the true mean initially centred on the population mean
    likelihood = fixedAR1('likelihood', k=rho_, tau_e=tau_, observed=timeseries-center)
    trace = pm.sample(target_accept=0.9)
    
print(pm.stats.hpd(trace['center']))
2 Likes

Awesome! Thanks for the follow up! could you please send a Pull Request to fix it?

Will do. Are you happy with the terminology I used?

var_ar1 = 1 / ((1-k**2)*tau_e)
sd_ar1 = tt.sqrt(var_ar1)
boundary = pm.Normal.dist(0., sigma=sd_ar1).logp

I think it is fine like this but adding some comments will help.

1 Like