Hi all,
I am trying to implement a multivariate state-space model in PyMC for compositional time-series data.
Suppose I have multiple related time series observed over a common time index, but some series only begin partway through the sample. For example:
- Series A and B are observed from
t=1
- Series C only starts at
t=50
Before t=50, Series C is genuinely unobserved/nonexistent.
The paper I am reading proposes the following:
e_{it} =
\begin{cases}
y_{it} - \hat{y}_{it}, & \text{if series } i \text{ is observed at time } t\\
0 & \text{otherwise}
\end{cases}
the ‘errors’ for these run-in periods are forced to zero using the formula
The idea is that the state-space recursion can still be written using the full dimensionality of y_{t}, while pre-entry observations remain np.nan.
What I am unsure about is how to handle these missing/pre-entry error components in PyMC so that they do not update the state and do not enter the observed-data likelihood/objective..
What would be the cleanest way to implement this in PyMC? Any guidance or example patterns would be greatly appreciated.
There is a public implementation available on GitHub which is not built on PyMC. Admittedly I haven’t spent a lot of time with this problem and also know very little about PyMC. But looking at the codebase I am trying find a similar alternative for this code in PyMC.
Have you tried pymc-extras statespace module? it handles missing data out of the box.
Links to some uses: pymc-extras/notebooks/Structural Timeseries Modeling.ipynb at main · pymc-devs/pymc-extras · GitHub
They don’t have missing data from a quick skim but it’s just a question of passing nan
it’s just a question of passing nan
And making sure they are ignored during likelihood calculation instead of imputing them. How do I make sure they are ignored and not imputed? I did find this comment from another thread.
If you look here they are filtering out the rows and columns of the covariance matrix wherever nan values are there. I was wondering if I can reproduce this in PyMC and how.
What’s the concern if they are imputed?
Anyway you can do a separate series for each starting when the series really starts and just share the structural parameters the same way you would in a same length series? you don’t need a single likelihood
This is precisely what we do here
1 Like
I think I’m just repeating what @ricardoV94 said above in more words.
With PyMC, you can just code series C starting at t = 50 without any dummy entries or zero errors. You can even offset indices by 50 so you don’t have to pad data with NaN values.
If before t = 50, series C is genuinely non-existent rather than unknown, it doesn’t make sense to write the state-space recursion in (A, B, C) for t in 0:50 or to impute values of it for the missing data times. If values for t in 0:50 are just unobserved in series C, you can impute the first 50 values without playing any games with the errors and then just throw away the imputations if you don’t care about them. You can also add some weakly informative priors to the imputed values.