# Diagnosing AR(1) model whose results are way off course

Hi all,
I’ve been comparing techniques to estimate the standard error of the mean for highly autocorrelated data such as timeseries. In the meantime I found some weird behaviour that has me questioning the paramerization and I’d appreciate some help. PyMC3 models perform well at low autocorrelation but, at high autocorrelation, the estimation of the mean is way off, where frequentist techniques fare much better. I’d expect the bayesian approach to perform at least comparatively. I have code and a random seed to reproduce below, to see it go to:

.

I’m using a stationary AR(1) model with a mean of zero, in which each measurement is the previous measurement multiplied by some parameter, `rho`, plus some noise. By the way, the autocorrelated curve that is generated in the tutorial is biased since it starts at zero (https://docs.pymc.io/notebooks/AR.html), unlike this function. If the process was initiated in the infinite past then you would never see the origin. This simulated data uses a ‘runup’ to avoid that:

``````####Generating correlated curve:
np.random.seed(2)
def gen_correlated_curve(rho, num=100):
num_with_runup = num+5000 #avoid seeing the origin
y = np.zeros((num_with_runup,))
for i in range(1,num_with_runup):
y[i] = rho * y[i-1] + np.random.normal()
return y[-num:]

timeseries = gen_correlated_curve(0.99, 3000)
print(timeseries.mean())
``````

out:

``````-2.7112278332640463
``````

So we’re looking for a mean in the ballpark of -2.7, and hopefully the HPD encompasses the true mean of 0. The PyMC3 model has a ‘center’ parameter that shifts the timeseries back to zero, to account for cases where the timeseries is stationary but not centred on zero:

``````with pm.Model() as ar1:
rho_ = pm.Uniform('rho',-1,1) #we assume process is stationary, so -1<k_<1
tau_ = pm.Gamma('tau',mu=1,sd=1) #precision of innovation
center = pm.Normal('center', mu=timeseries.mean(), sigma=5) #set the prior for the true mean to be initially centred on the population mean
likelihood = pm.AR1('likelihood', k=rho_, tau_e=tau_, observed=timeseries-center)
trace = pm.sample()
``````

Seems to run without error. But the HPD for the `center` is way off:

``````pm.stats.hpd(trace['center'])
``````

out:

``````array([-13.48658459,  -9.38930669])
``````

In case you think the data are just difficult, that’s not the case. Calculating the statistical inefficiency (equivalent to the time constant in this case) captures zero in the CI easily.

``````z95 = 1.959963984540054

#this is the Chodera lab 'statistical inefficiency' measure.
def statistical_inefficiency(corr, mintime=3):
N = corr.size
C_t = sm.tsa.stattools.acf(corr, fft=True, unbiased=True, nlags=N)
t_grid = np.arange(N).astype('float')
g_t = 2.0 * C_t * (1.0 - t_grid / float(N))
ind = np.where((C_t <= 0) & (t_grid > mintime))
g = 1.0 + g_t[1:ind].sum()
return max(1.0, g)

#and these actually return the values for SEM:
def sem_from_chodera(timeseries):
autocorrelation_time = statistical_inefficiency(timeseries)
n = len(timeseries)
#calculate SEM and return CI:
sem = np.std(timeseries) / np.sqrt(n/autocorrelation_time)
return timeseries.mean()-sem*z95, timeseries.mean()+sem*z95

print(sem_from_chodera(timeseries))
``````

out:

``````(-5.771172458391055, 0.34871679186296234)
``````

So this CI captures the true mean of zero. This technique is from `Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations`, but similar techniques for the ‘effective sample size’ used in MCMC also give good results (i.e. Sokal), as does block averaging, as does maximum likelihood in combination with the analytical expression for correcting the SEM of timeseries (I can post these too if you wish).

I’d really appreciate any thoughts whatsoever on why pymc3 misses, and thanks for your time!

OK looks like I’ve got a feasible explanation - and fixing it does solve this problem - the AR(1) variance might be defined incorrectly.

The `logp` function of AR1 has a `boundary` term which gets applied to the first entry of the observed data:

``````boundary = Normal.dist(0., tau=tau_e).logp
....
return boundary(x) + tt.sum(innov_like)
``````

Here the boundary is using the precision of the innovation term. The variance of an AR(1) process is not equal to the variance of the innovation term. Assuming the process was initiated in the infinite past, we can expect the first value to be consistent with the variance of the whole AR(1) process. The variance of an AR(1) process is:

``````variance = 1 / (1-rho**2)
``````

Where rho is the autocorrelation parameter, which is called `k` in the AR(1) specification.

Changing the boundary to use the correct standard deviation, derived from that expression above, solves the issue. That is:

``````##removed this:
#boundary = Normal.dist(0., tau=tau_e).logp
var_ar1 = 1 / ((1-k**2)*tau_e)
sd_ar1 = tt.sqrt(var_ar1)
boundary = pm.Normal.dist(0., sigma=sd_ar1).logp
``````

Here’s the full distribution definition and output, which correctly encompasses the true mean (zero):

``````class fixedAR1(pm.distributions.Continuous):
"""
Autoregressive process with 1 lag.
Parameters
----------
k: tensor
effect of lagged value on current value
tau_e: tensor
precision for innovations
"""

def __init__(self, k, tau_e, *args, **kwargs):
super().__init__(*args, **kwargs)
self.k = k = tt.as_tensor_variable(k)
self.tau_e = tau_e = tt.as_tensor_variable(tau_e)
self.tau = tau_e * (1 - k ** 2)
self.mode = tt.as_tensor_variable(0.)

def logp(self, x):
"""
Calculate log-probability of AR1 distribution at specified value.
Parameters
----------
x: numeric
Value for which log-probability is calculated.
Returns
-------
TensorVariable
"""
k = self.k
tau_e = self.tau_e

x_im1 = x[:-1]
x_i = x[1:]

##removed this:
#boundary = Normal.dist(0., tau=tau_e).logp
var_ar1 = 1 / ((1-k**2)*tau_e)
sd_ar1 = tt.sqrt(var_ar1)
boundary = pm.Normal.dist(0., sigma=sd_ar1).logp

innov_like = pm.Normal.dist(k * x_im1, tau=tau_e).logp(x_i)
return boundary(x) + tt.sum(innov_like)

with pm.Model() as ar1:
rho_ = pm.Uniform('rho',-1,1) #we assume process is stationary, so -1<k_<1
tau_ = pm.Gamma('tau',mu=1,sd=1)
center = pm.Normal('center', mu=timeseries.mean(), sigma=5) #set the prior for the true mean initially centred on the population mean
likelihood = fixedAR1('likelihood', k=rho_, tau_e=tau_, observed=timeseries-center)
trace = pm.sample(target_accept=0.9)

print(pm.stats.hpd(trace['center']))
``````
2 Likes

Awesome! Thanks for the follow up! could you please send a Pull Request to fix it?

Will do. Are you happy with the terminology I used?

``````var_ar1 = 1 / ((1-k**2)*tau_e)
sd_ar1 = tt.sqrt(var_ar1)
boundary = pm.Normal.dist(0., sigma=sd_ar1).logp
``````

I think it is fine like this but adding some comments will help.

1 Like