Rolling Regression in PyMC

jroberayalas · February 24, 2023, 3:11am

Hi,

I’m wondering if there’s an equivalent way in PyMC of doing Rolling OLS as described in the statsmodels package:

Rolling OLS applies OLS across a fixed windows of observations and then rolls (moves or slides) the window across the data set. The key parameter is window which determines the number of observations used in each OLS regression. By default, RollingOLS drops missing values in the window and so will estimate the model using the available data points.

Estimated values are aligned so that models estimated using data points i+1, i+2, ..., i+window are stored in location i+window.

Overall, instead of modeling the parameters of a linear model with a Gaussian Random Walk (like here), we trying several linear models on consecutive windows of data of fixed length and keep track of the parameters on each of these windows.

Any help is appreciated.

ricardoV94 · February 24, 2023, 7:09am

Sounds like an Autoregressive process? pymc.AR — PyMC 5.0.2 documentation

jroberayalas · February 24, 2023, 7:47am

It’s not necessarily an autoregressive process because we are fitting a model of the form y_W = X_W\beta_W + \epsilon that only covers a window W of the data and the design matrix does not necessarily contains lags of y. The window is moving and as it does, a new vector \beta_W is computed and stored.

jessegrabowski · February 24, 2023, 12:16pm

Rolling regression basically just runs a bunch of independent regressions and collects the results. You can do that easily in PyMC:

from functools import reduce
import pymc as pm
import arviz as az

# Other imports + data preparation as from the statsmodels example

def make_window_data(series, window_size):
    n = series.shape[0]
    col_name = series.name
    windowed_list = [series.reset_index(drop=True).to_frame().iloc[i:window_size+i].rename(columns={col_name:i}) for i in range(n)]
    windowed_df = reduce(lambda left, right: left.join(right.reset_index(drop=True)), windowed_list)
    
    return windowed_df

windowed_endog = make_window_data(endog, 60)
windowed_exog = make_window_data(exog['Mkt-RF'], 60)

# Statsmodels drops the ends (because there's missing data) but in principle you don't have to do this
windowed_endog.dropna(axis=1, how='any', inplace=True)
windowed_exog.dropna(axis=1, how='any', inplace=True)

This gives you a window_size, n-window_size array of data that you’ll use to do a bunch of regressions. Then just run them all on the same MCMC chain:

coords = {
    'regression':windowed_endog.columns
}

with pm.Model(coords=coords) as mod:
    betas = pm.Normal('beta', dims=['regression'])
    alphas = pm.Normal('alpha', dims=['regression'])
    sigmas = pm.HalfCauchy('sigmas', beta=0.8, dims=['regression'])
    
    mus = alphas + betas * windowed_exog
    
    obs = pm.Normal('obs', mu=mus, sigma=sigmas, observed=windowed_endog.values)
    idata = pm.sample()

The “key move” is that I use broadcasting to compute mu: the betas will be broadcast over the columns, so each column of mu corresponds to an independent regression. PyMC will also interpret this as a collection of independent normal distributions for you when you hand it to pm.Normal to compute the likelihood.

Results for the first 25 regressions (with statsmodels estimates as red triangles):

jroberayalas · February 25, 2023, 1:09am

Thanks @jessegrabowski, this is super helpful. I ran the code and it works nicely. I’m trying to replicate the plot that you shared. I managed to plot a subset of the parameters using this code:

az.plot_forest(
    idata.posterior.stack(sample=("chain", "draw")).sel(regression=[0,1,2,3,4]),
    var_names=["alpha", "beta"]
)

However, I do not know how you added the parameters from the RollingOLS models and reduced the numbers of chains to a single one. Can you give me a hint on this?

jessegrabowski · February 25, 2023, 1:13am

All the arviz plot functions return a list of matplotlib axis objects, so if you save it you can continue plotting on them as usual. I get the y-tick locations using yaxis.get_major_ticks().get_loc(), which combined with the parameter value itself lets you do a scatterplot:

ax = az.plot_forest(idata, var_names=['alpha', 'beta'], coords={'regression':np.arange(25)},
               combined=True)
y_vals = reversed(list(map(lambda i: ax[0].yaxis.get_major_ticks()[i].get_loc(), np.arange(50))))
for i, y in enumerate(y_vals):
    data = params.dropna().iloc[i % 25, i // 25]
    ax[0].scatter(data, y, marker='^', zorder=100, color='tab:red')

jroberayalas · February 25, 2023, 2:22am

I see. Awesome! One last question out of curiosity: the current model definition is saying that alpha and beta are independent, but in theory is possible to write these as a multivariate normal distribution, i.e.

coords = {
    'regression':windowed_endog.columns,
    'params'=['alpha', 'beta']
}
coefs = pm.MvNormal('coefs', mu=mu, chol=chol, dims=['regression', 'params'])

Overall, this should enable us to provide some more prior information to these parameters. Does this make sense at all even though the columns in the data matrix represent independent regressions?

jessegrabowski · February 25, 2023, 10:38am

Yeah you can definitely get creative with the priors. Then the regressions aren’t independent anymore, but this should be seen as an advantage of the Bayesian method vs vanilla OLS (why should they be independent?)

jroberayalas · February 25, 2023, 6:24pm

Thanks, Jesse! Really appreciate your help. It has been very insightful.

Topic		Replies	Views
Rolling regression with multivariate stored in a pandas dataframe v5 modeling	0	621	September 28, 2022
Fixed rolling regression window v5	1	394	July 19, 2022
Multivariate version of Rolling Regression Example v5	4	563	January 26, 2023
Forecasting with Time Varying Linear Models version agnostic modeling	6	1030	September 12, 2023
Fit many regressions simultaneously Questions	4	1147	November 3, 2017

Rolling Regression in PyMC

Related topics