Implementing late-entering series in a PyMC state-space model

Hi all,

I am trying to implement a multivariate state-space model in PyMC for compositional time-series data.

Suppose I have multiple related time series observed over a common time index, but some series only begin partway through the sample. For example:

  • Series A and B are observed from t=1
  • Series C only starts at t=50

Before t=50, Series C is genuinely unobserved/nonexistent.

The paper I am reading proposes the following:

e_{it} = \begin{cases} y_{it} - \hat{y}_{it}, & \text{if series } i \text{ is observed at time } t\\ 0 & \text{otherwise} \end{cases}

the ‘errors’ for these run-in periods are forced to zero using the formula

The idea is that the state-space recursion can still be written using the full dimensionality of y_{t}, while pre-entry observations remain np.nan.

What I am unsure about is how to handle these missing/pre-entry error components in PyMC so that they do not update the state and do not enter the observed-data likelihood/objective..

What would be the cleanest way to implement this in PyMC? Any guidance or example patterns would be greatly appreciated.

There is a public implementation available on GitHub which is not built on PyMC. Admittedly I haven’t spent a lot of time with this problem and also know very little about PyMC. But looking at the codebase I am trying find a similar alternative for this code in PyMC.

Have you tried pymc-extras statespace module? it handles missing data out of the box.

Links to some uses: pymc-extras/notebooks/Structural Timeseries Modeling.ipynb at main · pymc-devs/pymc-extras · GitHub

They don’t have missing data from a quick skim but it’s just a question of passing nan

it’s just a question of passing nan

And making sure they are ignored during likelihood calculation instead of imputing them. How do I make sure they are ignored and not imputed? I did find this comment from another thread.

If you look here they are filtering out the rows and columns of the covariance matrix wherever nan values are there. I was wondering if I can reproduce this in PyMC and how.

What’s the concern if they are imputed?

Anyway you can do a separate series for each starting when the series really starts and just share the structural parameters the same way you would in a same length series? you don’t need a single likelihood

This is precisely what we do here

1 Like