State Space Models in PyMC

In general your model should make use off al the information in the observed data. It sounds like you are mixing concerns of parameter inference and forecasting, which you could do by fitting a smaller dataset and assessing coverage on the remaining dataset.

But other than that, there’s no reason why you wouldn’t use all the observed past and future information when doing parameter inference.

Overfitting in a bayesian context would be addressed by stronger regularizing priors or different model structures, not by ignoring existing data.

Fair enough that I am mixing together two issues, inference and forecasting. When forecasting, of course all data is used, because the information set for out-of-sample data is all the data. Both methods, the Kalman filter and the GRW, should produce the same result. I will test this when I have time. In addition, outputs based on the Kalman smoothing distribution should be very close to the GRW results – I observed something like that way up in the beginning of this thread. So you get expanding window train-test split results for free with the Kalman filter by not using future information in the forward pass, then full-information estimation in the backward pass.

But there is still an issue with respect to inference, and there are reasons why you wouldn’t use future information when doing parameter inference. It depends on the perspective you want to take in your analysis: that of yourself, looking back from the future, or that of an historical agent who is generating the data. In the second context you aren’t ignoring data, you are handling the data in a manner that respects the data generating process. If you want to back-test a trading strategy based on market beta, for example, you need to use market beta estimates that only use information available up to time t, i.e. the Kalman filtered estimates. Otherwise your estimates will be “over fit” to the history that actually occurred, instead of the myriad that might have unfolded, given information at time t. This the sense in which I meant to use the term “over fit”.

These types of situations come up often in macroeconomics and finance. Again, my thinking on the issue is biased because that’s where I work. I can imagine there are similar concerns in other social sciences – election outcome prediction perhaps? – but I don’t have any expertise there so I couldn’t say.

1 Like

This partly due to my priors. Try changing to:

sigma_trend = pm.InverseGamma('sigma_trend', mu=0.25, sd=.2)
sigma_mu = pm.InverseGamma('sigma_mu', mu=.8, sd=.25)

If you are using the kalman-smoothed estimate fit on data up to time T, and then using that to predict T+1, I don’t think it will lead to model over-fitting. If you are using the kalman-smoothed estimate fit on data up to time T to predict T-X, then yes, I agree it could lead to overfitting.

That is definitely a valid reason to avoid the kalman-smoothed estimate.

AR models yes, MA models require scan. See here and Brandon’s response

1 Like

Hi, this is an example of using scan for AR1-MA1 model with pymc3.

@DanWeitzenfeld @brandonwillard , could you elaborate more on why MA models require scan, while AR models not?

The AR(’‘p’’) model is given by the equation

y_t = \sum_{i=1}^p \phi_i y_{t-i}+ \varepsilon_t.\

And the MA(’‘q’’) refers to the moving average model of order ‘‘q’’:

y_t = \mu + \sum_{i=1}^q \theta_i \varepsilon_{t-i} + \varepsilon_{t}

It seems to me AR models also need scan when p >= 2.

In case of GaussianRandomWalk (GRW):

y_t = \mu_t
\mu_t = \mu_{t-1} + \eta_{t}

So, for the mu \mu_t here, it is similar to AR(p=1). In this case, we can use either at.cumsum or np.cumsum (as in here) to replace scan. But we cannot use cumsum for case like this one, right?

y_t = \mu_t
\mu_t = \phi_1 * \mu_{t-1} + \phi_2 * \mu_{t-2} + \eta_{t}

Thank you.

MA should need scan for the same reason you can represent an AR model using simple OLS, but not an MA model. You need access to the error estimates \hat{\epsilon}_t at every time step, which isn’t possible unless you iteratively compute estimates for \hat{y}_t (\mu_t in your nomenclature).

I assume higher order AR processes should be possible without scan by just using clever slicing, something like:

mu_t = pm.Normal('hidden state', mu=0, sigma=1, shape=T)
phi = pm.Normal('MA params', mu=0, sigma=1, shape=2)
mu_hat = (mu_t * phi[0] + at.concatenate([np.nan], mu_t[1:]) * phi[1]).cumsum()

Would need to work out the details with the padding. Also need to write a transformation to constrain samples of the phis to the space of stationary distributions, otherwise this will easily explode the sampler for long time series/higher order processes.

1 Like

Because the AR model can be vectorized, but the MA cannot. Intuitively: in the AR model, you can calculate the values for epsilon_t in one fell swoop, no matter what the order is. E.g. for order 3:

epsilon = y[3:] - (phi_1 * y[0:-3] + phi_2 * y[1:-2] + phi_3 * y[2:-1])

(Before the recent refactor, the AR model in pymc3 didn’t use scan. )

Try to vectorize the MA model in this way and you’ll see it’s impossible.

Check out the logp method - you’ll see its similar to what I have above, but order 1 and no phi.

2 Likes

The AR still doesn’t use scan in the logp, just for the forward random sampling. Otherwise you are absolutely correct!

2 Likes