How to Create Multivariate AR(1)

I am attempting to create a time-dependent model which depends on an AR1. The point of the model is to model the number of three-pointers a team makes as a gaussian sampled from an AR1. Ideally I’d like to expand this to all 30 NBA teams. If I model just one team, it appears to work quite well.

with pm.Model() as warr_model:
       warr_tau = pm.HalfNormal('warr_tau',sigma = 10)

       warr_coef = pm.Normal('warr_coef',mu = 1, sigma = 20)

       warr_ar1 = pm.AR1('warr_ar1',k = warr_coef, tau_e = warr_tau, shape = len(warr_fg3m))![21%20pm|690x266](upload://yHYqSwFRqNf4GkfNM5mIHTh3ATa.png) 

       game_sd = pm.HalfNormal("game_sd", sigma = 20)

       y = pm.Normal('y',mu = warr_ar1,sigma = game_sd, observed = warr_fg3m)

I get normal results, as seen right here

If I do someething similar with New Orleans I also get normal results:

However, when I try to combine these two into a multivariate AR1 I get wacky results. Note in the code below, the results of New Orleans and the Warriors is independent.

with pm.Model() as warr_nola:
# This iss a combined model where we are doing analysis on both NOLA and Warriors
        comb_tau = pm.HalfNormal('comb_tau',sigma = 10,shape =2)

       comb_coef = pm.Normal('comb_coef',mu = 1, sigma = 20, shape = 2 )

       comb_ar1 = pm.AR1('comb_ar1',k = comb_coef, tau_e = comb_tau, shape = (40,2))

       comb_sd = pm.HalfNormal("comb_sd", sigma = 20,shape = 2)

       y_comb = pm.Normal('y_comb',mu = comb_ar1,sigma = comb_sd, observed = np.stack([nola_fg3m,warr_fg3m],axis =1))

If I sample from this distribution, I get pretty wacky results that are significantly different:

Can someone please explain what’s going wrong. Also the wierdness is combing from the fact that my k-coefficient goes from ~.98 to ~.02 when I change from a univariate to a multivariate case.

Just looking into similar things myself, with pymc v5. I believe a p-size array for AR coefficients are interpreted as an AR(p) process. In any case, why not just making several univariate distributions? (and if needed, stack them later).

Which version of PyMC are you using? The following advice applies to PyMC>4.0

The AR should have shape (2, 40) and the coefficients shape (2, 1) so that everything is aligned.

import pymc as pm

x = pm.AR.dist(
  init_dist=pm.DiracDelta.dist(0.0), 
  rho=[[0], [1.2]], 
  sigma=1,
  ar_order=1,
  shape=(2, 40),
)
pm.draw(x, random_seed=1).mean(1)
# array([ -0.35073367, 252.40673385])

There are good reasons to use batched variables instead of stacking individual ones, as it results in smaller and more vectorized/performant models. This only works if the timeseries have the same length of course.

1 Like

Hi @ricardoV94 , nice to read you. I just updated to V5. I’ll try that right away. May I ask a trickier question? What do you think is the best way to generate a multivariate AR(1) with correlation between them? I am dealing with time-series of tide-gauge data which display indivually an auto-correlated pattern, and are also correlated between them. What I did so far, was to generate a number (for now, 10, but that could rise to 100s) of AR(1) time-series, and then combine the series to obtain correlated noise, via the cholesky decomposition of the covariance matrix: correlated_timeseries = at.dot(chol, uncorrelated_timeseries). That worked nicely, but then the individual time-series are not strictly AR(1) any more (each become a linear combination of AR(1) time-series). That may not be a big deal, but it’s not very transparent. If I would do that by hand, I would first generate spatially correlated noise, and add use it for the innovation term. But I suppose I’d need to loop by hand, or use scan for it, which is not efficient.

I am also interested in other time-series models available, since AR(1) is only one idea. For instance, I guess some kind of Fourier series containing short and longer-term frequencies, could also work. But that should be the topic of another question. In any case, in my first attempt with AR(1), I was very pleased to see how well and relatively quickly pyMC NUTS fit the data.

No distribution usually stays under the original form after conditioning on the likelihood, so I don’t think it’s a problem that they are not AR(1) anymore. I would think about the generative process and then it may or not make sense to put AR(1) priors on the latent timeseries.

We have a notebook on VectorAutoregressive Models which (more?) explicitly account for correlations between the timeseries. They may be somewhat equivalent to your specification.

@jessegrabowski may be able to give you much more useful advice (and correct me if I am wrong) :slight_smile:

1 Like

Thanks much @ricardoV94 for the hint. Just by reading the abstract, I feel at the entrance of a gold mine: sounds exactly what I am after.

1 Like

VAR definitely sounds like the way to go : )

Just be aware that as the dimensionality of the coefficient matrix grows (either because you include more time series or more lags), you’re going to need to use tight priors to avoid divergences.

Also I just noticed there’s a typo in the VAR notebook, ars.append(ar) in the calc_ar_step function should be tabbed over once. The shape of betaX in the models should be Time x equations, but it’s only T.

1 Like

Agh!! Good catch. Will fix this: Error - incorrect indentation in Bayesian VAR notebook · Issue #523 · pymc-devs/pymc-examples · GitHub

1 Like

Hi, thanks for your hints @jessegrabowski.
So far I have not pursued that route because of the way VAR is fitted in the notebook.
The predictions are made for one time-step only (probably better convergence properties) but that is not possible in my case: I have gappy tide-gauge records that only partially overlap, and never all together. That means I have to make a forward prediction from time 0 all the way through the time-series (though it could be possible in theory to use observations instead of previous time-step prediction whenever they are available). That means using scan, which I am not good at… (I might be doing something wrong, but I failed just to reproduce the AR(1) class with scan – it does not even start the sampling). In contrast my current approach, though less transparent, seems to work quite well, and uses half the parameters (because of the symmetry of the correlation matrix, reflected in the LKJCholesky triangular structure). Here is what I do:

  ns = 14   # number of locations, also reflected in the `station` dimension
  ny = 200  # number of years for the AR(1) process

  # Parameters for the AR(1)
  rho = pm.Uniform("oceandyn_ar1_rho", dims='station')
  # ... I calculated sigma below so that the DIFF of the AR(1) process has variance = 1
  sigma = at.sqrt((1+rho)/2)  

  # Here 14 individual AR(1) processes are generated (with 5 years throw-away warmup)
  # and the diff is taken (I work on the rate, and the cumulative part is AR(1))
  warmup = 5
  ar1_slr = pm.AR(f"oceandyn_ar1_slr", rho=rho[None].T, sigma=sigma, ar_order=1, steps=ny+warmup,
   shape=(ns, ny+warmup+1), init_dist=pm.Normal.dist(0, 1)) # size years + 1
  uncorrelated_timeseries = at.diff(ar1_slr[:, warmup:], axis=1).T  # years x stations

  # Now I use the Cholesky matrix to introduce correlation
  chol, _, _ = pm.LKJCholeskyCov("oceandyn_chol", n=ns, eta=1, sd_dist=pm.HalfNormal.dist(25, dims=ns) )
  correlated_timeseries = at.dot(uncorrelated_timeseries, chol.T)  # picks up the variance of chol

  # And later, the likelihood function ..
  sat_rate = ...  # observations
  sat_sigma = pm.HalfNormal("satellite_error", 5, dims="station")
  pm.Normal(f"satellite_obs", mu=correlated_timeseries[1994-1900:2019-1900+1], sigma=sat_sigma, observed=sat_rate, dims=("sat_year", "station"))

Somewhat surprisingly, pyMC is able to derive the posterior of such a large guessing, solving in time and space. That is quite a convenient way not to have to split up the code into tuning and prediction. However I am still encountering convergence issues (which I think are due to other aspects), so the discussion is not closed yet.

If you post your scan code I can take a look. I’m not sure I really understand your model though, nor what you mean by “one time-stop only”. Each series records at different time frequencies? You really will need to align the time windows in any case. Under the hood, the AR distribution does essentially the same vectorization “trick” as is used in the VAR notebook.

AR(1) with scan looks like this:

with pm.Model() as mod:
    def step(x, rho):
        return rho * x
    
    rho = pm.Normal('rho')
    ar_mean, _ = pytensor.scan(step, non_sequences=[rho], sequences=dict(input=data, taps=[-1]))
    sigma = pm.Exponential('sigma', 1)
    
    # Drop first observation from the data (because I didn't model x0)
    obs = pm.Normal('obs', mu=ar_mean, sigma=sigma, observed=data[1:])
    idata = pm.sample(init='jitter+adapt_diag_grad')

Hi @jessegrabowski, thanks for offering to help. I’ll give it another go then, and share the scan function.

What I was saying:

  • all tide-gauge records have the same frequency (I’m using annual values) but cover different time intervals (1900-2019 for the longest one, another from 1980 to 1990, another from 2000 to 2019 etc), so at each given year, only a subset of stations will provide obs data.
  • one time-step only: in the VAR notebook, the fitting is split between a fitting and a prediction phase. In the fitting phase, there are actually n predictions one time-step ahead (taking observations at time t and making a prediction for time t+1, and then comparing that prediction with obs at time t+1)
  • the model. I use the fact that if U of shape (t, n) describes n uncorrelated time-series of length t and variance 1, its covariance matrix is the identity we have <U'@U> = n I(n). So for any Cholesky factor L of shape (n,n), then a set of time-series defined by V = U @ L' has has a covariance matrix equal to 1/n<V'@V> = 1/nL@<U'@U>@L' = LL' = C. This is valid for any set of uncorrelated time-series of variance 1. Here I derived U from a set of differentiated AR(1) processes. I worked out the math to define the standard deviation of each AR(1) process as a function of rho to make sure that in the end the time-series all have a variance of 1.

Hi @jessegrabowski ,

I tried to make a minimal example of the project I’m working on.
I created two notebooks that make use of tide-gauge data downloaded from the permanent service for mean sea level (psmsl). It tries to fit a linear trend to the tide-gauges with an AR(1) residual (the trend represents the combination of sea-level rise and vertical land motion, which we do not try to disentangle here). That exercise was pretty useful in itself, but if you find the time to give it some thinking, I’d be most grateful ! Let me explain here (the notebooks are not or poorly commented).

First regarding what I was saying in previous posts, I was actually attempting to model the noise explicitly, which is vastly more resource-intensive to do, and has convergence issues, but presents a few advantages (mainly a) it’s easier to build a meaningful likelihood function, as the covariances are modelled directly, b) missing data are handled seemlessly). Now I’ll try to focus on modelling the noise implictly, as is done in the VAR example. The main issue I encounter are missing data in the multivariate case. So in the process the question shifted to specifically multivariate AR(1) to how to handle missing data in multiple dimensions. Maybe that should be another question focused on missing data, but for the sake of the discussion above, and for the focus on auto-correlated noise, I stay here for now. I sum-up the experiments in the notebook’s header. I find that

  1. In the univariate case, missing data are best handled by just removing the nan bits, or by using a masked array to impute the missing data (I’m not sure what’s happening in there though, using as a black box). Setting missing data to zero (in predicted value and in obs) seems to introduce biases (only on the noise variance in the univariate example, but also on the slope in the multivariate case). I also compare with explicitly modelling the noise, which is further off. Here the results:

  1. In the second notebook (further down), I solve for 14 tide gauges simultaneously. I use a simplified VAR modelling where cross-correlation only come from the noise (there are no cross-coefficients for auto-correlation). Below the summary figure compares three settings:
  • uncorrelated noise with dropped nans (“uncorr drop”)
  • uncorrelated noise with zeroed nans (“uncorr zeros”)
  • correlated noise with zeroed nans (“corr”)

That seems to work alright but I am worried about missing data. There are at times small differences between dropped and zeroed nans in the uncorrelated case, and I worry that might impede a good estimate of the correlation between stations. Unfortunately, the masked array trick does not work in the multivariate case. I did attempt an equivalent of the “drop” case by constructing the full covariance matrix as cov = lkjcorr * stds * stds[None] and then using yearly observations MvNormal(..., mu=prediction[year, valid], cov=cov[valid, valid], observed=obs[year, valid]) but pymc does not seem to handle it (it compiles forever; not included in the toy notebook). The explicit noise modelling also took forever to run so I gave up here (I my real application it actually works in a reasonable time (15 min per 1000 draws per chain), albeit with a few divergences ~10 per 1000 samples).

That is it. I am not sure how easy/difficult that is to follow this, or whether that presents any interest to you in term of generality / applicability to other users. If I can make the examples more useful, please let me know.

I saw here that @fonnesbeck is knowledgeable about missing data. Maybe there is a way to handle the multivariate case as well?

Any hint would be much appreciated. Thanks.

EDIT: here the multivariate model in full:


import numpy as np
import pymc as pm
import pytensor.tensor as pt

# get the data
# ----------------
import urllib.request
import io
import pandas as pd

def fetch_psmsl(id):
    url = f"https://psmsl.org/data/obtaining/rlr.annual.data/{id}.rlrdata"
    with urllib.request.urlopen(url) as response:
       urlData = response.read()    
    df = pd.read_csv(io.StringIO(urlData.decode()), sep=';', header=None, na_values=-99999)
    df.columns = ['year', 'value', 'flag', 'missing days']
    df['value'] -= 7000
    s = df.set_index('year')['value']
    s.name = id
    return s


stations = {'VENEZIA (S.STEFANO)': 39,
 'VENEZIA (ARSENALE)': 87,
 'VENEZIA (PUNTA DELLA SALUTE)': 168,
 'VENEZIA II': 2100,
 'TRIESTE': 154,
 'KOPER': 1009,
 'LUKA KOPER': 1817,
 'TRIESTE II': 2099,
 'SPLIT - GRADSKA LUKA': 352,
 'SPLIT RT MARJANA': 685,
 'BAKAR': 353,
 'DUBROVNIK': 760,
 'BAR': 1075,
 'ZADAR': 1859}

data = pd.DataFrame({id:fetch_psmsl(id) for id in stations.values()})

df = data
years = df.index.values
nt, n = df.shape

with pm.Model() as model:
    
    model.add_coord('year', years)
    model.add_coord('year_predicted', years[1:])
    model.add_coord('station', length=n)
    
    rho = pm.Normal('rho', dims='station')
    # sigma = pm.HalfNormal('sigma', 1, dims='station')
    sigma = pm.HalfNormal.dist(1, size=n)
    chol, _, _ = pm.LKJCholeskyCov('lkj', eta=1, sd_dist=sigma, n=n)
    intercept = pm.Normal('intercept', 0, 10, dims='station')
    slope = pm.Normal('slope', 0, 100, dims='station')
    
    trend = intercept[None] + slope[None]*(years[:, None]-1950)
    pm.Deterministic(f'trend', trend, dims=('year', 'station'))

    # Factor out missing values by setting predicted = observed = 0
    missing = np.isnan(df.values)
    obs = df.fillna(0).values
    
    ano = obs - trend
    ar_mean = ano*rho[None]
    pred = trend[1:] + ar_mean[:-1]
    
    mask_pred = missing[:-1]
    aligned_obs = obs[1:].copy()
    mask_obs = missing[1:]
    
    mask = mask_pred | mask_obs
    
    for j in range(n):
        i, = mask[:, j].nonzero()
        pred = pt.set_subtensor(pred[i, j], 0)
        aligned_obs[i, j] = 0
        
    # masking out missing data does not work in multiple dimensions
    # observed = np.ma.array(aligned_obs, mask=mask)
    observed = aligned_obs
    pm.Deterministic(f'trend_plus_armean', pred, dims=('year_predicted', 'station'))
    pm.ConstantData(f'obs', aligned_obs, dims=('year_predicted', 'station'))
    pm.Deterministic(f'ar_mean', ar_mean[:-1], dims=('year_predicted', 'station'))
    pm.MvNormal(f'tidegauge_obs', mu=pred, chol=chol, observed=observed, dims=('year_predicted', 'station'))

    idata = pm.sample()