Making Out of Sample Predictions with GaussianRandomWalk

The error you’re seeing is caused by the fact that the inferred values of beta_1 that are stored in the trace have shape (chains, draws, 200). So when you try to do predictions switching the regressor values to a shape of (71,), beta_1 * regressor fails. However, this error isn’t the main concern with what you are trying to do.

A few central assumptions must be met for you to be able to do out of sample predictions by just switching in and out regressor values, but leaving the model structure the same:

  • The new observations must be conditionally independent from the past regressor values
  • The coefficient values must not change when adding the new regressors

In your problem, beta_1 is a GaussianRandomWalk. This means that beta_1 changes over time, and its future values are conditionally related to the ones that were inferred on the training data. For this reason, making out of sample predictions isn’t as simple as just building a new independent model with new regressor values. As a rule of thumb, this happens every time that you have time series distributions. What you need to do instead is to forecast the potential future values of beta_1, and use those to make predictions of the potential future values of observed.

There are a few roads you could take to do this. One is to use masked arrays of observations so that the forecast is inferred at the time you call pm.sample. The downside with this is that you would need to know the future regressor values before you call pm.sample, so you usually can’t do this. The way that I’d go about doing this is to create a new random variable that represents the future beta_1 values. The bad part of this is that at the moment it’s hard to set the initial value of a time series distribution in pymc, so you’ll have to write down the distribution itself as a cumulative sum of normal innovations:

# Build the model for training
with pm.Model(coords=coords_train) as model:
    regressor = pm.ConstantData('x_train',x_train,dims='obs_ind')
    beta_1 = pm.GaussianRandomWalk('beta_1',dims='obs_ind')
    beta_0 = pm.Normal('beta_0',mu=0,sigma=10)
    mu = beta_0 + beta_1 * regressor
    output = pm.ConstantData('y_train', y_train)
    sigma = pm.HalfCauchy('sigma',10)
    observed = pm.Normal('observed',mu=mu,sigma=sigma,observed=output,dims='obs_ind')
    # Infere the posterior
    trace = pm.sample(return_inferencedata=True)

# Add variables to the model to get the forecast and predictions
with model:
    model.add_coords({"future_obs_ind": df_test.index}) # This way you'll get the sequential index between df_train and df_test
    future_regressor = pm.ConstantData("x_test", x_test, dims="future_obs_ind")
    future_output = pm.ConstantData("y_test", x_test, dims="future_obs_ind")
    future_beta_1_innovations = pm.Normal("future_beta_1_innovations", 0, 1, dims="future_obs_ind")
    future_beta_1 = pm.Deterministic(
        "future_beta_1",
        beta_1[..., -1] + future_beta_1_innovations.cumsum(axis=-1),
        dims="future_obs_ind",
    )
    future_observed = pm.Normal(
        "future_observed",
        mu=beta_0 + future_beta_1 * future_regressor,
        sigma=sigma,
        observed=future_output,
        dims="future_obs_ind",
    )
    pp = pm.sample_posterior_predictive(trace)

This will allow you to forecast, but you’ll see that the GaussianRandomWalk is a bad prior for the rate in this dataset:

l0, = pp.observed_data.observed.plot(color="k")
pp.posterior_predictive.future_observed.stack(sample=["chain", "draw"]).isel(
    sample=slice(None, None, 100)
).plot.line(x="future_obs_ind", color="C0", alpha=0.1, add_legend=False)
l1, = pp.posterior_predictive.future_observed.mean(["chain", "draw"]).plot.line(
    x="future_obs_ind", color="C1"
)
l1.axes.set(ylim=[-20, 20], xlabel="time index", ylabel="Observed values")
l1.axes.legend([l0, l1], ["Observed", "Expected"])

image

4 Likes