Out of sample prediction

I’m trying to make an out-of-sample prediction using a model_factory setup. But I’m getting a shape error when using pymc5, but the same code works on pymc3. The error is due to the size deference of the “seasonality” random variable, which assumes it needs to have the same size as the training dataset. The prediction works when I increase the period to 60 months.

def model_factory(data, n_order):

    store_idx, stores = data.store.factorize(sort=True)
    item_idx, items = data.item.factorize(sort=True)
    date_idx, date = data.date.factorize(sort=True)

    t = date_to_timeindex(date)
   
    fourier_features = pd.DataFrame(
        {
            f"{func}_order_{order}": getattr(np, func)(2 * np.pi * t * order)
            for order in range(1, n_order + 1)
            for func in ("sin", "cos")
        }
    )
    print(fourier_features.shape)
    COORDS = {
        "obs": data.index,
        "store": stores,
        "item": items,
        "date": date,
        "fourier_features": np.arange(2 * n_order),
    }

    with pm.Model(coords=COORDS) as model:
        _t = pm.MutableData("t", t, dims="date")
        _date_idx = pm.MutableData("date_idx", date_idx, dims="obs")
        _store_idx = pm.MutableData("store_idx", store_idx, dims="obs")
        _item_idx = pm.MutableData("item_idx", item_idx, dims="obs")
        sales = pm.MutableData("sales", data.sales, dims="obs")

        baseline = pm.Normal(
            "baseline", mu=np.log(1000), sigma=0.9, dims=["store", "item"]
        )
        trend = pm.Normal("trend", mu=0, sigma=0.05)

        fourier_coefs = pm.Normal(
            "fourier_coefs", mu=0, sigma=0.1, dims="fourier_features"
        )
        seasonality = pm.Deterministic(
            "seasonality",
            pm.math.dot(fourier_coefs, fourier_features.to_numpy().T),
            dims="date",
        )
def forecast_demand(idata,
                    forecasted_dates=pd.date_range("2018-01", "2019-01", freq="M"),):
    
    n_order = len(idata.posterior.coords["fourier_features"]) // 2
    print(n_order)
    store = idata.posterior.coords["store"].values
    item = idata.posterior.coords["item"].values
    forecast_data_index = pd.DataFrame(
        data=0,
        index=pd.MultiIndex.from_product(
            [forecasted_dates, store, item], names=["date", "store", "item"],
        ),
        columns=["sales"],
    )
    print(forecasted_dates.shape)
    forecast_data = forecast_data_index.reset_index()
    print(forecast_data.shape)
    print(forecast_data.head())
    
    with model_factory(data=forecast_data, n_order=n_order) as forecast_model:
        ppc = pm.sample_posterior_predictive(idata)
           
    return ppc, forecast_data
 241     raise TypeError(
    242         "The numpy.ndarray object is not aligned."
    243         " PyTensor C code does not support that.",
    244     )
    246 if not all(
    247     ds == ts if ts is not None else True
    248     for ds, ts in zip(data.shape, self.shape)
    249 ):
--> 250     raise TypeError(
    251         f"The type's shape ({self.shape}) is not compatible with the data's ({data.shape})"
    252     )
    254 if self.filter_checks_isfinite and not np.all(np.isfinite(data)):
    255     raise ValueError("Non-finite elements not allowed")

TypeError: ("The type's shape ((12,)) is not compatible with the data's ((60,))", 'Container name "seasonality"')

Thanks for posting the code. Can you post the model graphviz output for pymc3 and pymc v5? Thatll be one easy way to visualize what is different about the models

I’m using the same model code. I think the issue might be due to Data containers or how the seasonality is specified with Deterministic

pymc3:

I have seasonality.tag.test_value.shape = (12,)

pymc5

have seasonality.shape TensorConstant{(1,) of 12}

Got it, Its just out of sample predictions then? Sampling works?

If so you may need to use sample dims I believe. Do you have example data for both inference, and then out of sample data for posterior predictions?

Yes the model sample. The issue is only on out-of-sample prediction. and I only see this issue in pymc5. The same code works in pymc3

The train df is between 2013 and end of 2017 so I have 60 months.

Screenshot 2023-04-05 at 6.38.24 AM

and I’m trying to get prediction for one 12 months in 2018.

Is there an end to end example on how this needs to be implemented? the example on the PyMC website is to simple and does not cover much of the more complicated use cases. (Out-Of-Sample Predictions — PyMC example gallery)

I had an “end-to-end-example” from v3. I updated it to v5 now. But as with your code, it seems to crash for “shape” reasons with pymc5/pytensor.

The issue occurs below where it writes “n_predictions” / “CRITICAL”: if one uses “set_data” with something that has a different shape, one will get shape mismatch.


import os as os
import numpy as np
import pandas as pd
import pymc as pm
import scipy.stats as stats
import pytensor as th
import pytensor.tensor as tt
import matplotlib.pyplot as plt


### simulated data
print ('#### Simulation ####')
n_observations = 9999
n_observables = 2

mean = [-1.5, 2.2]
cov = [[0.8, -0.6], [-0.6, 1.4]]
obs0, obs1 = np.random.multivariate_normal(mean, cov, n_observations).T

print ('obs corr:', stats.pearsonr(obs0, obs1))


# combining the data
data = pd.DataFrame.from_dict({ \
                                  'index': np.arange(n_observations) \
                                , 'obs0': obs0 \
                                , 'obs1': obs1 \
                              }).set_index('index', inplace = False)

### Model Construction and sampling
print ('#### Modeling ####')
# shared theano objects (pm.Data)
shared = {}

with pm.Model() as model:

    # using pm.DAta; yet the error also occurs when using th.shared directly
    shared['interecept'] = pm.Data(f'intercept_data', np.ones((n_observations, 1)), mutable = True)

    ## intercept
    intercept = pm.Normal( 'intercept' \
                         , mu = 0., sigma = 1. \
                         , shape = (1, n_observables) \
                        )
    estimator = pm.Deterministic('i2', tt.dot(shared['interecept'], intercept))


    ## observable
    if False: # <- this is the "KILLSWITCH"
        # individual observables
        # with this, prediction WORKS

        # residual
        residual = pm.HalfCauchy('residual', 1., shape = (1,n_observables))

        posterior = pm.Normal(  'posterior' \
                                , mu = estimator \
                                , sigma = residual \
                                , observed = data.loc[:, ['obs0', 'obs1']].values \
                              )
    else:
        # multivariate observables
        # with this, prediction FAILS

        # LKJ Prior
        sd_dist = pm.HalfNormal.dist(1.)
        # sd_dist = pm.Exponential.dist(1., shape = n_observables)
        # packed_cholesky, corr, stds = pm.LKJCholeskyCov(  'packed_cholesky' \
        packed_cholesky = pm.LKJCholeskyCov(  'packed_cholesky' \
                                              , n = n_observables \
                                              , eta = 1. \
                                              , sd_dist = sd_dist \
                                              , compute_corr = False \
                                              # , store_in_trace = True \
                                            )

        # compute the observables covariance and correlation matrix
        cholesky_matrix = pm.expand_packed_triangular(n_observables, packed_cholesky, lower = True)

        # posterior|likelihood|observables
        posterior = pm.MvNormal(  'posterior' \
                                , mu = estimator \
                                # , chol = packed_cholesky \
                                , chol = cholesky_matrix \
                                , observed = data.loc[:, ['obs0', 'obs1']].values \
                               )

    # inference
    trace = pm.sample(  draws = 2**10, tune = 2**10 \
                      # , cores = 4, chains = 4 \
                      , return_inferencedata = True \
                      )

# show outcome
# print (pm.summary(trace))
# pm.plot_trace(trace)
# plt.show()

### predictive sampling
print ('#### Prediction ####')

# CRITICAL: the desired number of predictions is different from the number of observations
n_predictions = 7 # THIS CRASHES
# ... or not (technical test)
n_predictions = n_observations # THIS WORKS

# adjust intercept shape
probe_data = {}
probe_data['intercept_data'] = np.ones((n_predictions, 1))

# how often to repeat the n_predictions draws
n_repeats = 10 # OBSOLETE see discussion before

with model:
    pm.set_data(probe_data)
    prediction = pm.sample_posterior_predictive(trace \
                                                # , samples = n_repeats \
                                                # , size = n_predictions \
                                                )

print()
print ([k for k in prediction.keys()])
print (prediction['posterior'])


edit: I now realized that the “prediction” outcome has been changed to an xarray, which causes secondary problems in my script. I removed the final plot of the prediction.

Thanks @falk for your reply. I came to the same conclusion as your comments. “: if one uses “set_data” with something that has a different shape, one will get shape mismatch.” This makes sense but, at the same time, it is a bit strange. For instance, if I have 1M train data, I would need to predict for another 1M test example, and I won’t be able to test on only, let’s say, 1000 hold-out samples.

I see that your probe_data['intercept_data'] has the same size of the training dataset (9999).

Is this correct, or have I missed something?

That is correct. I set this so that the code actually runs through. Coincidentally. If you comment out the second line which starts n_predictions = , then the example above will fail.

@RavinKumar note that sample sample dims should have nothing to do with it: my prior request centered around the now-obsolete samples/size keywords in pymc.sample_posterior_predictive(). Those are not used in this updated example.

Ah thats right. @ricardoV94 any idea? Is this something to do with the new shape handling in posterior predictive?

Usually this problems arises because one defined observed variables without an explicit shape which tells PyMC how the observed variable shape depends on a shared variable or parameter shape.

If not told anything PyMC will default to shape=observed.shape

The first example in the docstrings of set_data shows an example of providing an explicit shape to allow posterior predictive with new shape:

https://www.pymc.io/projects/docs/en/stable/api/generated/pymc.set_data.html

1 Like

@ricardoV94, thanks for your reply. However, my issue is due to pm.Deterministic()part of the model and this method does nowt accept shape. Is that correct? Can you see where I should add the shape?

You can’t add shape to a Deterministic, a Deterministic only records the value of a variable, it doesn’t affect it.

If you have a shape problem is posterior_predictive it means you are trying to resize some part of the graph that has been “frozen” because PyMC was not told how the shape should depend on a shared variable (MutableData). Usually this comes from the observed variable. If you need to also change the shape of unobserved variables you are in trouble because you can no longer use posterior predictive.

Deterministics will be automatically resampled if they depend on a variable that has changed shape (assuming you are using a recent enough version of PyMC)

@ricardoV94 I’m using pymc '5.1.2' and it seams that the my Deterministics variable cannot resample based on the new same shape it gets.

However the same code works using '3.11.5'.

By any change do you see the issue in my code?

Do you have a minimal reproducible example I can run? Your original snippets seem to be truncated?

My hunch is that because you are creating a new model, and not using MutableData to store your fourier_features, PyMC does not know the shape has changed and does not resample the fourier_coefs. You can either make those MutableData, or you can manually add fourier_coefs (and your observed variable) to var_names in posterior_predictive to tell PyMC to ignore the values of fourier_series in the trace.

Also sorry for my previous advice, it was directed at the code shared by @falk. I didn’t notice there were two models being talked about in this issue.

3 Likes

@ricardoV94, thanks for your suggestion. Everything works now, after adding _fourier_features to a data container.

 with pm.Model(coords=COORDS) as model:
        _t = pm.Data("t", t,  mutable = True, dims="date")
        _date_idx = pm.Data("date_idx", date_idx,  mutable = True, dims="obs")
        _store_idx = pm.Data("store_idx", store_idx,  mutable = True, dims="obs")
        _item_idx = pm.Data("item_idx", item_idx,  mutable = True, dims="obs")
        _fourier_features = pm.Data("fourier_features", fourier_features.to_numpy().T,  mutable = True,)
        sales = pm.Data("sales", data.sales,mutable = True, dims="obs")

1 Like

Sorry also from my side for the confusion - i just dug up my “minimal example” from some time back, hoping it would be useful.
… And I think it still leaves some open questions, seems the LKJ is not compatible yet with the new posterior sampling? I regret that I have no time to investigate further. But I’m glad the OP issue is solved!
Cheers!

@falk No problems. Can you perhaps open a separate issue to not get the two models tangled up? It should definitely be possible to do posterior predictive with LKJ models

I’ll try if I find the time and get back to this (currently wrapping up my PhD thesis :wink:
I must also check my old notes, maybe my code above was not the latest attempt.
But for years I seem to be the only one repeatedly bumping into LKJ and reporting back here, so certainly this is not urgent :slight_smile:

2 Likes