Making Out of Sample Predictions with GaussianRandomWalk

I am running into the classical shape issue when sampling the posterior predictive distribution on out-of-sample data, as described in this pymc discourse video by @lucianopaz . The reproducible example is as follows:

``````import numpy as np
import pymc as pm
import pandas as pd
import arviz as az
from pydataset import data
import matplotlib as plt

#get data
df = data('faithful')

#define factory function
def rolling_regression_model_factory(coords,x,y):
with pm.Model(coords=coords) as model:
regressor = pm.Data('x',x,dims='obs_ind')
beta_1 = pm.GaussianRandomWalk('beta_1',dims='obs_ind')
beta_0 = pm.Normal('beta_0',mu=0,sigma=10)
mu = beta_0 + beta_1 * regressor
output = pm.Data('y',y)
sigma = pm.HalfCauchy('sigma',10)
observed = pm.Normal('observed',mu=mu,sigma=sigma,observed=output,dims='obs_ind')
return model

#split train and test
df_train = df[0:200]
x_train = df_train.waiting
y_train = df_train.eruptions
coords_train = {'obs_ind':np.arange(len(df_train))}
df_test = df[201:272]
x_test = df_test.waiting
y_test = df_test.eruptions
coords_test = {'obs_ind':np.arange(len(df_test))}

#sample on training data
with rolling_regression_model_factory(coords=coords_train,x=x_train,y=y_train):
trace = pm.sample()

#sample pp on test data
with rolling_regression_model_factory(coords=coords_test,x=x_test,y=y_test):
pp = pm.sample_posterior_predictive(trace)
``````
``````ValueError: Input dimension mismatch. One other input has shape[0] = 200, but input[2].shape[0] = 71.
Apply node that caused the error: Elemwise{Composite{(i0 + (i1 * i2))}}(InplaceDimShuffle{x}.0, beta_1, x)
Toposort index: 1
Inputs types: [TensorType(float64, (1,)), TensorType(float64, (None,)), TensorType(int32, (None,))]
Inputs shapes: [(1,), (200,), (71,)]
Inputs strides: [(8,), (8,), (4,)]
Inputs values: [array([-4.42644186]), 'not shown', 'not shown']
Outputs clients: [[normal_rv{0, (0, 0), floatX, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F7ECA41D9E0>), TensorConstant{[]}, TensorConstant{11}, Elemwise{Composite{(i0 + (i1 * i2))}}.0, sigma)]]
``````

Can you edit you message with triple backticks ` to format the code correctly? Itâ€™s very difficult to read like this

Done, thanks for flagging

The error youâ€™re seeing is caused by the fact that the inferred values of `beta_1` that are stored in the trace have shape `(chains, draws, 200)`. So when you try to do predictions switching the regressor values to a shape of `(71,)`, `beta_1 * regressor` fails. However, this error isnâ€™t the main concern with what you are trying to do.

A few central assumptions must be met for you to be able to do out of sample predictions by just switching in and out regressor values, but leaving the model structure the same:

• The new observations must be conditionally independent from the past regressor values
• The coefficient values must not change when adding the new regressors

In your problem, `beta_1` is a `GaussianRandomWalk`. This means that `beta_1` changes over time, and its future values are conditionally related to the ones that were inferred on the training data. For this reason, making out of sample predictions isnâ€™t as simple as just building a new independent model with new regressor values. As a rule of thumb, this happens every time that you have time series distributions. What you need to do instead is to forecast the potential future values of `beta_1`, and use those to make predictions of the potential future values of `observed`.

There are a few roads you could take to do this. One is to use masked arrays of observations so that the forecast is inferred at the time you call `pm.sample`. The downside with this is that you would need to know the future regressor values before you call `pm.sample`, so you usually canâ€™t do this. The way that Iâ€™d go about doing this is to create a new random variable that represents the future `beta_1` values. The bad part of this is that at the moment itâ€™s hard to set the initial value of a time series distribution in pymc, so youâ€™ll have to write down the distribution itself as a cumulative sum of normal innovations:

``````# Build the model for training
with pm.Model(coords=coords_train) as model:
regressor = pm.ConstantData('x_train',x_train,dims='obs_ind')
beta_1 = pm.GaussianRandomWalk('beta_1',dims='obs_ind')
beta_0 = pm.Normal('beta_0',mu=0,sigma=10)
mu = beta_0 + beta_1 * regressor
output = pm.ConstantData('y_train', y_train)
sigma = pm.HalfCauchy('sigma',10)
observed = pm.Normal('observed',mu=mu,sigma=sigma,observed=output,dims='obs_ind')
# Infere the posterior
trace = pm.sample(return_inferencedata=True)

# Add variables to the model to get the forecast and predictions
with model:
model.add_coords({"future_obs_ind": df_test.index}) # This way you'll get the sequential index between df_train and df_test
future_regressor = pm.ConstantData("x_test", x_test, dims="future_obs_ind")
future_output = pm.ConstantData("y_test", x_test, dims="future_obs_ind")
future_beta_1_innovations = pm.Normal("future_beta_1_innovations", 0, 1, dims="future_obs_ind")
future_beta_1 = pm.Deterministic(
"future_beta_1",
beta_1[..., -1] + future_beta_1_innovations.cumsum(axis=-1),
dims="future_obs_ind",
)
future_observed = pm.Normal(
"future_observed",
mu=beta_0 + future_beta_1 * future_regressor,
sigma=sigma,
observed=future_output,
dims="future_obs_ind",
)
pp = pm.sample_posterior_predictive(trace)
``````

This will allow you to forecast, but youâ€™ll see that the `GaussianRandomWalk` is a bad prior for the rate in this dataset:

``````l0, = pp.observed_data.observed.plot(color="k")
pp.posterior_predictive.future_observed.stack(sample=["chain", "draw"]).isel(
sample=slice(None, None, 100)
).plot.line(x="future_obs_ind", color="C0", alpha=0.1, add_legend=False)
l1, = pp.posterior_predictive.future_observed.mean(["chain", "draw"]).plot.line(
x="future_obs_ind", color="C1"
)
l1.axes.set(ylim=[-20, 20], xlabel="time index", ylabel="Observed values")
l1.axes.legend([l0, l1], ["Observed", "Expected"])
``````

3 Likes

This is amazing, thank you so much!!

Why is it hard?

Is there a way to set the initial point exactly equal to a certain value? As far as I understand, you need to feed in a `RandomVariable`. Maybe you could make it work with a `DiracDelta`?

Iâ€™m thinking out loud here, but wouldnâ€™t it be nice to have a `conditional` method, similar to what we have with GPs, but for time series? I imagine that you would be able to do something like

``````
series = pm.SomeTimeSeries(...)
forecast = pm.SomeTimeSeries.conditional("a_name?", conditioned_on=series, ...)
``````

We could even make `conditional` `ignore_logprob` on the generated random variables. That would allow you to

• Build a `forecast` node at inference time
• Change the forecast interval using data containers
• Ignore `forecast` during inference so that our steppers donâ€™t mess up with those

I see what you mean now. We donâ€™t have any sampler that can handle Diracdelta, but you can just make the initial point be like any other innovation (ie a normal centered around the last observed value and with the same mu/sigma as the rest of the distribution)

``````next_grw = pm.GaussianRandomWalk("next_grw", init_dist=pm.Normal(mu=prev_grw[-1]+mu, sigma=sigma), mu=mu, sigma=sigma)
``````

There is nothing special about the `initial_dist`, itâ€™s just there to allow a more flexible distribution for the first point in the time-series. In this case it doesnâ€™t need special treatment.

1 Like