Making Out of Sample Predictions with GaussianRandomWalk

I am running into the classical shape issue when sampling the posterior predictive distribution on out-of-sample data, as described in this pymc discourse video by @lucianopaz . The reproducible example is as follows:

import numpy as np
import pymc as pm
import pandas as pd
import arviz as az
from pydataset import data
import matplotlib as plt


#get data
df = data('faithful')

#define factory function
def rolling_regression_model_factory(coords,x,y):
    with pm.Model(coords=coords) as model:
        regressor = pm.Data('x',x,dims='obs_ind')
        beta_1 = pm.GaussianRandomWalk('beta_1',dims='obs_ind')
        beta_0 = pm.Normal('beta_0',mu=0,sigma=10)
        mu = beta_0 + beta_1 * regressor
        output = pm.Data('y',y)
        sigma = pm.HalfCauchy('sigma',10)
        observed = pm.Normal('observed',mu=mu,sigma=sigma,observed=output,dims='obs_ind')
    return model

#split train and test
df_train = df[0:200]
x_train = df_train.waiting
y_train = df_train.eruptions
coords_train = {'obs_ind':np.arange(len(df_train))}
df_test = df[201:272]
x_test = df_test.waiting
y_test = df_test.eruptions
coords_test = {'obs_ind':np.arange(len(df_test))}

#sample on training data
with rolling_regression_model_factory(coords=coords_train,x=x_train,y=y_train):
    trace = pm.sample()
    
#sample pp on test data
with rolling_regression_model_factory(coords=coords_test,x=x_test,y=y_test):
    pp = pm.sample_posterior_predictive(trace)
ValueError: Input dimension mismatch. One other input has shape[0] = 200, but input[2].shape[0] = 71.
Apply node that caused the error: Elemwise{Composite{(i0 + (i1 * i2))}}(InplaceDimShuffle{x}.0, beta_1, x)
Toposort index: 1
Inputs types: [TensorType(float64, (1,)), TensorType(float64, (None,)), TensorType(int32, (None,))]
Inputs shapes: [(1,), (200,), (71,)]
Inputs strides: [(8,), (8,), (4,)]
Inputs values: [array([-4.42644186]), 'not shown', 'not shown']
Outputs clients: [[normal_rv{0, (0, 0), floatX, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F7ECA41D9E0>), TensorConstant{[]}, TensorConstant{11}, Elemwise{Composite{(i0 + (i1 * i2))}}.0, sigma)]]

Can you edit you message with triple backticks ` to format the code correctly? It’s very difficult to read like this

Done, thanks for flagging

The error you’re seeing is caused by the fact that the inferred values of beta_1 that are stored in the trace have shape (chains, draws, 200). So when you try to do predictions switching the regressor values to a shape of (71,), beta_1 * regressor fails. However, this error isn’t the main concern with what you are trying to do.

A few central assumptions must be met for you to be able to do out of sample predictions by just switching in and out regressor values, but leaving the model structure the same:

  • The new observations must be conditionally independent from the past regressor values
  • The coefficient values must not change when adding the new regressors

In your problem, beta_1 is a GaussianRandomWalk. This means that beta_1 changes over time, and its future values are conditionally related to the ones that were inferred on the training data. For this reason, making out of sample predictions isn’t as simple as just building a new independent model with new regressor values. As a rule of thumb, this happens every time that you have time series distributions. What you need to do instead is to forecast the potential future values of beta_1, and use those to make predictions of the potential future values of observed.

There are a few roads you could take to do this. One is to use masked arrays of observations so that the forecast is inferred at the time you call pm.sample. The downside with this is that you would need to know the future regressor values before you call pm.sample, so you usually can’t do this. The way that I’d go about doing this is to create a new random variable that represents the future beta_1 values. The bad part of this is that at the moment it’s hard to set the initial value of a time series distribution in pymc, so you’ll have to write down the distribution itself as a cumulative sum of normal innovations:

# Build the model for training
with pm.Model(coords=coords_train) as model:
    regressor = pm.ConstantData('x_train',x_train,dims='obs_ind')
    beta_1 = pm.GaussianRandomWalk('beta_1',dims='obs_ind')
    beta_0 = pm.Normal('beta_0',mu=0,sigma=10)
    mu = beta_0 + beta_1 * regressor
    output = pm.ConstantData('y_train', y_train)
    sigma = pm.HalfCauchy('sigma',10)
    observed = pm.Normal('observed',mu=mu,sigma=sigma,observed=output,dims='obs_ind')
    # Infere the posterior
    trace = pm.sample(return_inferencedata=True)

# Add variables to the model to get the forecast and predictions
with model:
    model.add_coords({"future_obs_ind": df_test.index}) # This way you'll get the sequential index between df_train and df_test
    future_regressor = pm.ConstantData("x_test", x_test, dims="future_obs_ind")
    future_output = pm.ConstantData("y_test", x_test, dims="future_obs_ind")
    future_beta_1_innovations = pm.Normal("future_beta_1_innovations", 0, 1, dims="future_obs_ind")
    future_beta_1 = pm.Deterministic(
        "future_beta_1",
        beta_1[..., -1] + future_beta_1_innovations.cumsum(axis=-1),
        dims="future_obs_ind",
    )
    future_observed = pm.Normal(
        "future_observed",
        mu=beta_0 + future_beta_1 * future_regressor,
        sigma=sigma,
        observed=future_output,
        dims="future_obs_ind",
    )
    pp = pm.sample_posterior_predictive(trace)

This will allow you to forecast, but you’ll see that the GaussianRandomWalk is a bad prior for the rate in this dataset:

l0, = pp.observed_data.observed.plot(color="k")
pp.posterior_predictive.future_observed.stack(sample=["chain", "draw"]).isel(
    sample=slice(None, None, 100)
).plot.line(x="future_obs_ind", color="C0", alpha=0.1, add_legend=False)
l1, = pp.posterior_predictive.future_observed.mean(["chain", "draw"]).plot.line(
    x="future_obs_ind", color="C1"
)
l1.axes.set(ylim=[-20, 20], xlabel="time index", ylabel="Observed values")
l1.axes.legend([l0, l1], ["Observed", "Expected"])

image

3 Likes

This is amazing, thank you so much!!

Why is it hard?

Is there a way to set the initial point exactly equal to a certain value? As far as I understand, you need to feed in a RandomVariable. Maybe you could make it work with a DiracDelta?

I’m thinking out loud here, but wouldn’t it be nice to have a conditional method, similar to what we have with GPs, but for time series? I imagine that you would be able to do something like


    series = pm.SomeTimeSeries(...)
    forecast = pm.SomeTimeSeries.conditional("a_name?", conditioned_on=series, ...)

We could even make conditional ignore_logprob on the generated random variables. That would allow you to

  • Build a forecast node at inference time
  • Change the forecast interval using data containers
  • Ignore forecast during inference so that our steppers don’t mess up with those

I see what you mean now. We don’t have any sampler that can handle Diracdelta, but you can just make the initial point be like any other innovation (ie a normal centered around the last observed value and with the same mu/sigma as the rest of the distribution)

next_grw = pm.GaussianRandomWalk("next_grw", init_dist=pm.Normal(mu=prev_grw[-1]+mu, sigma=sigma), mu=mu, sigma=sigma)

There is nothing special about the initial_dist, it’s just there to allow a more flexible distribution for the first point in the time-series. In this case it doesn’t need special treatment.

2 Likes

I just encountered this problem, so thanks @lucianopaz and @ricardoV94 for the great explanations, and @zlegge for the great question!

FWIW I wanted an example that combined everything, and added explicit drift, so created a gist here: Example model for holdout (future) set PPC using a GaussianRandomWalk in pymc3 · GitHub - possibly it helps the next reader :slight_smile:

I’m not sure if the poor performanace is due to the nature of the data, or using the wrong likelihood (tried Poisson and Lognormal but they were dreadful), or the model is missing e.g. @junpenglao’s smoothener / identifier trick