Indexing of Posterior Predictions in State Space Model

I’ve been playing around with the PyMCStateSpace class in pymc-experimental. Today I’m attempting to recreate the following figure from Commandeur & Koopman, page 24:

I have used the following code to fit the model with PyMC:

import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pytensor
import pytensor.tensor as pt
from pymc_experimental.statespace import structural as st
import pymc_experimental.statespace 

# Load and clean the data
dat = pd.read_csv("data/ukdriversksi.txt", skiprows=1, header=None, names=['y'])
date_range = pd.date_range(start='1969-01-01', periods=192, freq='MS')
dat['log_y'] = np.log(dat['y'])
dat['date'] = date_range

# Set up the matrixes
ll = st.LevelTrendComponent(order=2)
ss_mod = ll.build()

# Declare the model and samplle
with pm.Model(coords=ss_mod.coords) as model:
    P0_diag = pm.Gamma("P0_diag", alpha=2, beta=1, dims=['state'])
    P0 = pm.Deterministic("P0", pt.diag(P0_diag), dims=['state', 'state_aux'])
    
    initial_trend = pm.Normal("initial_trend", sigma=[10, 10], dims=['trend_state'])
    sigma_trend = pm.Gamma("sigma_trend", alpha=2, beta=1, dims=['trend_shock'])

    ss_mod.build_statespace_graph(dat['log_y'], mode="JAX")
    idata = pm.sample(nuts_sampler="numpyro")

# Generate posterior predictive samples
with model:
     pm.sample_posterior_predictive(idata, extend_inferencedata=True)

# Convert to a pandas df for my sanity
posterior_predictive_df = idata.posterior_predictive.to_dataframe().reset_index()

# Create a 'date' column to match the dataset
posterior_predictive_df['date'] = posterior_predictive_df['time'].apply(lambda x: date_range[x])

# Compute the grouped mean value of 'obs'
posterior_predictive_df['post_pred_mean'] = posterior_predictive_df.groupby('date')['obs'].transform('mean')

# Keep only unique values
posterior_predictive_df = posterior_predictive_df.groupby('date')['post_pred_mean'].agg('mean').reset_index()

# Janky manual shift
# posterior_predictive_df['post_pred_mean'] = posterior_predictive_df['post_pred_mean'].shift(-1)

# Plot
plt.figure(figsize=(14, 7))
plt.plot(posterior_predictive_df['date'], posterior_predictive_df['post_pred_mean'], label='Posterior predictive mean', color='red')
plt.plot(dat['date'], dat['log_y'], label='Observed data', color='blue')

The posterior predictions seem to lead the observed timepoints by 1. It seems like this could be due to:

  1. Misunderstanding on my part of how indexing works with these models;
  2. Error in how I’ve specified the model;
  3. Error in my data munging.

Thanks in advance for your help: probably this is just a small thing, but I’m new to PyMC and have gotten stumped on how to debug. Hoping to get to the bottom of it before I try to fit more complex models from that book.

I think you’re getting the shifting because the posterior predictive appends the estimated initial state to the series. So it starts one timestep before the data starts – do you actually need to shift it backwards in that case. But that said: (Wrong and dumb, see below)

  1. If you set a DateTimeIndex on the data you pass tobuild_statespace_graph, the model will automatically store the dates and propagate them to all of the outputs. I wrote all this because reasoning about adding extra initial steps or not adding extra initial steps is maddening, and I don’t wish it on anymore.

  2. Sampling the posterior predictive is non-trivial, so there are a bunch of helper functions there for you. Do you want the conditional posterior or the uncoditional? The smoothed, filtered or predicted? The hidden states or the observed state? Should measurement error be included? etc etc etc.

    In particular, you can use ss_mod.sample_conditional_posterior, which will give you outputs of the kalman filter (predicted, filtered, and smoothed) using the posterior parameter draws in idata. This means each timestep is conditioned on the data up to (predicted), including (filtered), or given everything beyond (smoothed) the data at time t.

So you can look at the outputs like this:

post = ss_mod.sample_conditional_posterior(idata)

fig, ax = plt.subplots(figsize=(14,4), dpi=144)
x_grid = dat.index

mu = post.predicted_posterior_observed.mean(dim=['chain', 'draw']).values
hdi = az.hdi(post.predicted_posterior_observed).predicted_posterior_observed

ax.plot(x_grid[1:], mu[1:])
ax.fill_between(x_grid[1:], *hdi.values.squeeze()[1:].T, alpha=0.25)
dat.log_y.iloc[1:].plot(ax=ax)

I sliced away the first observation because the predicted value was zero for all draws. That might be a bug, I need to check. It should be equal to the initial states, which are not zero.

Also, I had to play around with the priors a bit, especially the initial acceleration (the 2nd value of initial_trend). A value of 10 is saying that time series could suddenly accelerate by as much as 20, which is more than the maximum range of the data. You’ll still learn the correct mean in this case, but it will cause your credible intervals to blow up.

This problem is worse in the one-step predictions than in the kalman filter or smoother, but we don’t really want to look at those outputs, because when there’s no measurement error they just noiselessly encode the observed data. Those are only interesting for looking at the hidden states (e.g. acceleration)

1 Like

Oh, I should mention that some degree of “seeing double” when you look at the in-sample plots is correct. Here’s the same model in statsmodels:

import statsmodels.api as sm

mod = sm.tsa.UnobservedComponents(dat['log_y'], level=True, trend=True, 
                                  stochastic_level=True, stochastic_trend=True,
                                  irregular=False)
res = mod.fit()

fig, ax = plt.subplots(figsize=(14,4), dpi=144)
res.predict()[1:].plot(ax=ax)
dat.log_y.iloc[1:].plot(ax=ax)

Why does this happen? Each blue value is the one-step ahead forecast from the last orange point, and the local level model only knows how to continue dynamics seen up until that point. So when the acceleration is positive, the one-step ahead is just where the last data point was, increased. When acceleration is negative, same. Or flat. But it can never guess that the data will reverse – it’s not in the models’ nature. So in this case where we have this spikey seasonal patter, the local level model will always be chasing the last value, and you’ll get this nice 1980’s 3D glasses effect.

I think your original plot was probably correct, but I won’t edit my post above because I still think working with dates is maddening and you should leave it to the program to do automatically for you.

Edit: A good sanity check if to plot the filtered values. These should line up perfectly. Back to the PyMC model:

fig, ax = plt.subplots(figsize=(14,4), dpi=144)
x_grid = dat.index

mu = post.filtered_posterior_observed.mean(dim=['chain', 'draw']).values
hdi = az.hdi(post.filtered_posterior_observed).filtered_posterior_observed

ax.plot(x_grid[1:], mu[1:], alpha=0.5, color='tab:blue')
ax.fill_between(x_grid[1:], *hdi.values.squeeze()[1:].T, alpha=0.25)
dat.log_y.iloc[1:].plot(ax=ax, alpha=0.5, color='tab:red')

1 Like