Multivariate time series inference

Hello everyone,

In biomedical fields (and in most scientific fields I suppose), the majority of the data are time series. However, most of the statistical methods used reduce time series into a single, arbitrary, data point (such as mean, median or maximum). To avoid this considerable loss of information, some techniques exist (such as spm1d), but they are not well known and are based in the frequentist framework.

I have already successfully applied Bayesian models for scalar datasets, however I don’t really know how to formulate a time related model.

Consider the following example: participants from two populations (A and B) perform six tests of an experimental task generating three variables (x, y and z) of size 100 each (say 100 seconds). You want to obtain posterior distributions that consider the time component and co-variances of the variables to make inferences (e.g. how population A differs from population B). How would you formulate such model?

Did you have a look at MvGaussianRandomWalk? I would also try a Multivariate auto-regressive model.

This is probably what you’re looking for. In the econometrics literature, what you describe (multiple variables [including population], sequential data) is typically called ‘panel data’. There’s a pymc3-specific hit for a google search here. In general the model is given an additional index (over time), allowing for lag-dependences, such as (for example, individual i at time t):

result_{i,t} ~ task_offset_{i} + population_offset_{i,t} + lag1_effect * result_{i, t-1} + lag2_effect * result_{i, t-2} + ... + error_{i, t}.

This either means modeling the response as a matrix (not all that standard), or unraveling into a very long vector (more standard).

3 Likes

I think there is actually a lot of room for improvement in PyMC3’s treatment of time-series, though I have not been using PyMC3 for very long, I may just not be aware of the best way to proceed.

That being said, one the most powerful tools for modelling multivariate time series is the markovian state space model, so it seems natural to want to have this directly available. I’m working on implementing this model in PyMC3 here: https://github.com/RJTK/pymc3-kalman, where I forked an existing Kalman filter implementation (the likelihood comes from running a forward Kalman filtering pass). Unfortunately, this implementation seems to be extremely slow.

I have also had some success with VAR models implemented just with the pm.Normal distribution and some looping, e.g.

x_target, X_lagged = make_lagged_data(X, p)
x_target = x_target[:, 0, :]

T, n, K = X.shape  # Time, nodes, reps

with pm.Model() as model:
    beta = pm.Uniform("beta", -1, 1, shape=(n, p))
    mu = pm.StudentT("mu", nu=3)
    sd = pm.HalfStudentT("sv", nu=3)

    pred = 0.0
    for tau in range(p):
        for i in range(n):
            pred += beta[i, tau] * X_lagged[tau][:, i]
    pm.Normal("ar", mu=mu + pred, sd=sd,
              observed=x_target)

which is set up for one-step-ahead prediction of a target series from a collection of lagged components. This is fast enough with ADVI, but is surely not the best way to go about building bigger time series models.

Here is a relevant and recent paper on Bayesian multivariate time series: http://www.jmlr.org/papers/volume19/18-009/18-009.pdf

5 Likes