I’m working on adding a Moving Average component to a time series model and I noticed that pymc.MA doesn’t exist yet, but pymc.AR does. Does it make sense to develop pymc.MA? Happy to do a pull request myself, just wanted to gauge interest first.
These days we can generate most timeseries with a CustomDist, so there’s not as much need to prebuilt them: PyMC_timeseries_AR.ipynb · GitHub
More slow paced example for an AR in the official examples, but same approach: Time Series Models Derived From a Generative Graph — PyMC example gallery
There’s also BayesianSARIMA in pymc-experimental
Is there a way to include the intercept? I’m trying a few different things to get the intercept to work, but no luck.
AR(1)
ss_mod = pmss.BayesianSARIMA(order=(1, 0, 0), verbose=True)
y2 = y.reshape(-1,1)
with pm.Model(coords=ss_mod.coords) as sarima_model:
intercept = pm.Normal("intercept", mu=0, tau=0.0001)
state_sigmas = pm.Uniform("sigma_state",0,100,dims=ss_mod.param_dims['sigma_state'])
ar_params = pm.Normal("ar_params",0,tau=0.0001,dims=ss_mod.param_dims['ar_params'])
# Build the state-space model
ss_mod.build_statespace_graph(y2, mode="JAX")
# Sampling
trace = pm.sample(nuts_sampler='numpyro',target_accept=.99)
Do you mean a drift term? There’s not usually an “intercept” in SARIMA models. Does your data have a trend you are trying to capture? One typically differences that away by using an e.g. ARIMA(1,1,0)
@jessegrabowski does that mean you’d model the trend on time
and add an ARIMA(1, 1, 0) to the model to capture np.diff(time, 1)
?
It depends what kind of trend you are dealing with. Suppose you have a time series with a gaussian random walk component, like:
\begin{align} x_t &= \alpha_t + \rho x_{t-1} + \epsilon_t\\ \alpha_t &= \alpha_{t-1} + \eta_t \end{align}
Where \epsilon_t, \eta_t are i.i.d whatever. If you compute first differences you get:
\begin{align} \Delta x_t &= x_t - x_{t-1} \\ \Delta x_t &= \alpha_t + \rho x_{t-1} + \epsilon_t - \alpha_{t-1} - \rho x_{t-2} - \epsilon_{t-1} \\ \Delta x_t &= (\alpha_t - \alpha_{t-1}) + \rho (x_{t-1} - x_{t-2}) + (\epsilon_t - \epsilon_{t-1} \\ \Delta x_t &= \eta_t + \rho \Delta x_{t-1} + \Delta \epsilon_t \\ \end{align}
You get a new AR(1) model in the differences, and the trend goes away (You can see this by just re-arranging the GRW equation). So this model is stationary – you don’t have any more trend to worry about (assuming | \rho | < 1).
Similar exercises can show that differencing will wipe out a drift term (x_t = \alpha + \rho x_{t-1} + \epsilon_t, notice there’s no subscript t on the \alpha term) and deterministic trend (x_t = \beta t + \rho x_{t-1} + \epsilon_t) or any combination.
So if you think your data has some simple trend structure like one of these three, you can just directly estimate an ARIMA(1,1,0) and wipe it out completely. You can also just model your data in first differences. The advantage of modeling with ARIMA(1,1,0) is that the trend will be automatically included in any forecasts (you’d have to add it back in yourself if you difference the data by hand first).
I see, thanks @jessegrabowski !
Definitely sounds more practical and less error prone when doing OOS predictions. I understand from that that the statespace
module does the differencing itself and the user doesn’t have to feed in the differenced data – correct?
And what if want need a step linear trend (+ potential seasonality), like the GAM-AR model in this book? Can statespace
handle that, or the user needs to difference the data by hand?
So GRW + AR??? Had never seen that Does this have a technical name?
Yes, if you specify a difference like pmss.BayesianSARIMAX(order=(1,1,0)
, then you will pass the raw, undifferenced data. The observed data will be modeled as a cumsum over differenced hidden states. The ARMA dynamics will enter via those hidden states. An ARIMA(1,1,0) looks something like:
\begin{bmatrix} x_t \\ \Delta x_t \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 0 & \phi_1 \end{bmatrix} \begin{bmatrix} x_{t-1} \\ \Delta x_{t-1} \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \varepsilon_t
The observed data will correspond to the x_t state. This is what I meant when I said it’s handled “automatically”. You also don’t lose an observation, as when you difference by hand.
I don’t super grok all the tfp code, but I believe this model would correspond to:
from pymc_experimental.statespace import structural as pmss
linear_trend = pmss.LevelTrendComponent(order=2, innovations_order=0)
ar = pmss.AutoregressiveComponent(order=1)
seasonal = pmss.FrequencySeasonality(season_length=12, n=2, innovations=False)
ss_mod = (linear_trend + ar + seasonal).build()
PyMC model:
import pymc as pm
import pandas as pd
import pytensor.tensor as pt
dates = pd.date_range(start='2000-01-01', freq='MS', periods=12*24)
dummy_data = pd.Series(np.nan, index=dates)
with pm.Model(coords = ss_mod.coords) as mod:
P0_diag = pm.Gamma('P0_diag', alpha=2, beta=1)
P0 = pm.Deterministic('P0', pt.eye(ss_mod.k_states) * P0_diag, dims=['state', 'state_aux'])
# This initial state is the slope of the intercept and trend
initial_trend = pm.Normal('initial_trend', sigma=1e-3, dims=['trend_state'])
ar_params = pm.Beta('ar_params',
**pm.find_constrained_prior(pm.Beta, lower=0.8, upper=0.99, init_guess={'alpha':10, 'beta':1}),
dims=['ar_lag'])
sigma_ar = pm.Exponential('sigma_ar', 1)
# These intial states are the strength of the seasonal effects
seasonality = pm.ZeroSumNormal('Frequency[s=12, n=2]', sigma=10, dims=['Frequency[s=12, n=2]_state'])
ss_mod.build_statespace_graph(dummy_data, mode='JAX')
prior = pm.sample_prior_predictive()
prior_sample = ss_mod.sample_unconditional_prior(prior)
component_idata = ss_mod.extract_components_from_idata(prior_sample)
Plotting:
idx = np.random.choice(500)
fig, ax = plt.subplots(figsize=(14, 4), layout='constrained')
prior_sample.prior_observed.sel(chain=0, draw=idx, observed_state='data').plot.line(x='time', ax=ax)
ax.set_title('')
fig, ax = plt.subplots(4, 1, figsize=(14, 8), layout='constrained')
for axis, state in zip(fig.axes, component_idata.coords['state'].values):
component_idata.prior_latent.sel(state=state, chain=0, draw=idx).plot.line(x='time', ax=axis)
axis.set_title(state)
Here’s a draw from the prior:
And here’s how the latent state decompose:
It would be under the umbrella of “structural time series decomposition” I guess. I don’t think this specific combination has a special name. I chose it to say that even a complex trend can be wiped out by differencing.
These videos (part 1 and part 2) are a nice overview of the structural modeling approach, but it has matlab code so bring holy water. At least it’s not STATA!
Damn, thanks for the level of details @jessegrabowski !!
Yeah this is quite new to me (with GPs you tend to explicitely model it), which is why I’m asking lots of questions – still feels like magic, which I don’t like
I’m definitely happy that this is not getting rid of the trend, but modeling it
I think that’s right. pmss.LevelTrendComponent(order=2, innovations_order=0)
is super cool! I wasn’t sure if that was imposing a global or local order 2 constraint, but since you’re commenting about the slope of the intercept and trend in the PyMC model, I’m guessing it’s indeed local.
The main thing I don’t get yet is P0
and P0_diag
. I have to dig deeper into the statespace
docs.
A couple miscellaneous questions, if I may:
- Any reason you’re choosing such a small sigma for the prior on
initial_trend
? - The Beta on
ar_params
is to impose stationary, right?
I wrote a lot of detail in the docstring, but basically you can think about the order as the number of derivatives of “position” you want in the trend. order=1
just includes the position of the system (the “level” – a constant mean). order = 2
gives position + velocity (aka trend). order = 3
gives position + velocity + acceleration (aka quadratic trend), and so on.
This is why the Level component is going up, and the trend component is flat. The trend is the velocity of the level, so it literally is the trend. The level is where all that velocity accumulates (and it’s the level that is ultimately added into the observed data)
P0 is your initial belief about the uncertainty of where all of these hidden states are located before you see the first datapoint. It is necessary because we’re doing an iterative updating. Basically you have a belief about the current state of the system x_t \sim N(\hat x_t, P_t), and because everything is nice and Gaussian, we can update this belief when the next data point arrives (via the same conditioning logic that GPs use!). As data comes in we update both \hat x_t and P_t, so your initial choices of these isn’t super imporant. Even if you pick a ridiculous P_0, it will be washed away by the data in about 10-25 time steps.
What sucks is that if you do this so-called “diffuse initialization” (just set P0 = np.eye(n_state) * 1e9
for example, as statsmodels
does) you will have to cut off the first few timesteps when you make output plots. This is because the uncertainty will be exploding really bad at first, then it settles down quickly. The data is also only very weakly informative about it, so I just try to pick something that sort of makes sense with the scale of the data for the level, then something close to the identity matrix for everything else.
So it’s basically a nuisance parameter. For stationary models like BayesianSARIMAX
and BayesianVARMA
, you can avoid it by computing the steady-state covariance implied by your transition model and just using that. For now we force you to pick it for non-stationary models, but that sucks and I want to figure out a better way.
Any reason you’re choosing such a small sigma for the prior on
initial_trend
?
If the trend (velocity) is too large relative to the scale of the data, it will swamp out all the other dynamics in the system. I wanted the wiggles from the AR and seasonal components to be visible, so I set it small. You will typically find that the trend sigma needs to be quite small.
The Beta on
ar_params
is to impose stationary, right?
Yep!
Ha ha yeah, I’m literally reading these and the pymcX examples these days. Your explanation here is already super helpful, but I’ll let you know if I come up with more questions
Ah that’s good to know! Even though if your data is annual, 10-25 steps can be a lot.
All of this makes sense. Would definitely be great to abstract that out from the API in a later iteration, but I don’t think it’s insurmountable – I just need to read the theory / docs a bigger number of times The videos you shared are helpful BTW, thanks.
I wasn’t aware of that at all, really good to know, thanks Jesse!