Hi all,
I’ve been comparing techniques to estimate the standard error of the mean for highly autocorrelated data such as timeseries. In the meantime I found some weird behaviour that has me questioning the paramerization and I’d appreciate some help. PyMC3 models perform well at low autocorrelation but, at high autocorrelation, the estimation of the mean is way off, where frequentist techniques fare much better. I’d expect the bayesian approach to perform at least comparatively. I have code and a random seed to reproduce below, to see it go to:
.
I’m using a stationary AR(1) model with a mean of zero, in which each measurement is the previous measurement multiplied by some parameter, rho
, plus some noise. By the way, the autocorrelated curve that is generated in the tutorial is biased since it starts at zero (https://docs.pymc.io/notebooks/AR.html), unlike this function. If the process was initiated in the infinite past then you would never see the origin. This simulated data uses a ‘runup’ to avoid that:
####Generating correlated curve:
np.random.seed(2)
def gen_correlated_curve(rho, num=100):
num_with_runup = num+5000 #avoid seeing the origin
y = np.zeros((num_with_runup,))
for i in range(1,num_with_runup):
y[i] = rho * y[i-1] + np.random.normal()
return y[-num:]
timeseries = gen_correlated_curve(0.99, 3000)
print(timeseries.mean())
out:
-2.7112278332640463
So we’re looking for a mean in the ballpark of -2.7, and hopefully the HPD encompasses the true mean of 0. The PyMC3 model has a ‘center’ parameter that shifts the timeseries back to zero, to account for cases where the timeseries is stationary but not centred on zero:
with pm.Model() as ar1:
rho_ = pm.Uniform('rho',-1,1) #we assume process is stationary, so -1<k_<1
tau_ = pm.Gamma('tau',mu=1,sd=1) #precision of innovation
center = pm.Normal('center', mu=timeseries.mean(), sigma=5) #set the prior for the true mean to be initially centred on the population mean
likelihood = pm.AR1('likelihood', k=rho_, tau_e=tau_, observed=timeseries-center)
trace = pm.sample()
Seems to run without error. But the HPD for the center
is way off:
pm.stats.hpd(trace['center'])
out:
array([-13.48658459, -9.38930669])
In case you think the data are just difficult, that’s not the case. Calculating the statistical inefficiency (equivalent to the time constant in this case) captures zero in the CI easily.
z95 = 1.959963984540054
#this is the Chodera lab 'statistical inefficiency' measure.
def statistical_inefficiency(corr, mintime=3):
N = corr.size
C_t = sm.tsa.stattools.acf(corr, fft=True, unbiased=True, nlags=N)
t_grid = np.arange(N).astype('float')
g_t = 2.0 * C_t * (1.0 - t_grid / float(N))
ind = np.where((C_t <= 0) & (t_grid > mintime))[0][0]
g = 1.0 + g_t[1:ind].sum()
return max(1.0, g)
#and these actually return the values for SEM:
def sem_from_chodera(timeseries):
autocorrelation_time = statistical_inefficiency(timeseries)
n = len(timeseries)
#calculate SEM and return CI:
sem = np.std(timeseries) / np.sqrt(n/autocorrelation_time)
return timeseries.mean()-sem*z95, timeseries.mean()+sem*z95
print(sem_from_chodera(timeseries))
out:
(-5.771172458391055, 0.34871679186296234)
So this CI captures the true mean of zero. This technique is from Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations
, but similar techniques for the ‘effective sample size’ used in MCMC also give good results (i.e. Sokal), as does block averaging, as does maximum likelihood in combination with the analytical expression for correcting the SEM of timeseries (I can post these too if you wish).
I’d really appreciate any thoughts whatsoever on why pymc3 misses, and thanks for your time!