How can i sample posteriors from training-data, run posterior predictive on testing data (with complex factor graph)

jessegrabowski · April 13, 2023, 10:33am

You can definitely run the whole mapping over everything, why not? You just then need to split the map into train and test parts, and include data objects for the map in your model. I see a lot of places where the example you posted can be simplified.

First, I organized the data into a DataFarme, and used pd.cut to quickly get the bin indexes. The goal is to vectorize all the operations you’re currently doing with loops. To do that, I want an integer array that maps each datapoint to it’s knot based on the calendar day (which we can grab using the .day property of the DateTimeIndex)

# I increased the timerange because it didn't actually make it to 140 observations
# in your example. I also renamed "input" to "data" -- a personal preference.

timerange = pd.date_range(start="2023-01-01", end="2023-07-24", freq='D')
coefficientknots = {'0-15': 2.0, '15-32': 3.0}
exponentknots = {'0-15': 0.9, '15-32': 0.3}
data = generate_input_data(timerange)
true_values = compute_true_values(timerange, coefficientknots, exponentknots)

# Gather stuff into a dataframe for organization
df = pd.DataFrame(np.c_[data, true_values], index=timerange, columns=['data', 'targets'])
df['day'] = df.index.day

# Create the map by assiging a bin index to each datapoint based on which day it falls into
knot_cutpoints = [0, 15, 32]
df['knot_idx'] = pd.cut(df['day'], knot_cutpoints, labels=False)

# Train-test split includes the knot mapping
df_train = df.iloc[:140].copy()
df_test = df.iloc[140:].copy()

The next step is to set up the coords for the model. It’s not strictly necessary, but highly recommended. It will make our outputs nicer and the code more readable.

# These names will be used to make the model output look nicer
knot_names = [f'{a}-{b}' for a, b in zip(knot_cutpoints[:-1], knot_cutpoints[1:])]

# Coords let us refer to dimension by their name, rather than having to remember their shapes
coords = {'knot':knot_names}

Next is the model. First, I changed usages of pm.Data to pm.MutableData, because we’re going to want to swap this data out and do out-of-sample prediction. Second, I renamed things to emphasize that they are arbitrary data containers that will hold more than just training data.

I use the dimension named in coords to draw multiple times from the same priors, eliminating the loops. Since I pre-computed the map from row to knot, I use fancy indexing to put the right coefficients with the right data.

Finally is the line I labeled “Important!”. It’s good practice to link the shape of the outcome variable to the shape of the input data. This prevents shape errors when you do pm.set_data on the input but not the output (i.e. when you have no output, i.e. for out-of-sample prediction). More on this here.

with pm.Model(coords=coords) as model:
    data_pt = pm.MutableData('input', df_train.data)
    targets_pt = pm.MutableData(f'convresponse', df_train.targets)
    
    # Here is the mapping; when we do out-of-sample prediction we'll update this as well
    knot_idx_pt = pm.MutableData('knot_idx', df_train.knot_idx)

    # Using coords, draw all the parameters for all knots at once
    knot_coef = pm.Normal(f'knot_coef', 0, 1, dims=['knot'])
    knot_exponent = pm.Gamma(f'knot_exponent', 3, 3, dims=['knot'])
    
    # Use the mapping defined above to assign the correct coefficient to the 
    # correct observation
    modelledresponse = tt.power(knot_coef[knot_idx_pt] * data_pt, 
                                knot_exponent[knot_idx_pt])

    sigma = pm.HalfCauchy(name="sigma", beta=10)
    outcome = pm.Normal(name="outcome",
                  mu=modelledresponse,
                  sigma=sigma,
                  observed=targets_pt,
                        shape = data_pt.shape[0], # Important!
                       initval={'knot_exponent':[1, 1]}) # I found sometimes the exponents blow up

    idata = pm.sample()

Finally, do prediction on the test set using pm.set_data

with model:
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)
    pm.set_data({'input':df_test.data, 'knot_idx':df_test.knot_idx})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)

Here’s what I get:

Topic		Replies	Views
Posterior predictive distributions of parameter values differs a lot from posterior?	18	556	November 23, 2023
Get samples of posterior on new data mapped with set_data	6	242	November 15, 2023
Best Practices for Time Series Forecasting version agnostic	11	3768	August 15, 2024
Is this half cauchy model correct? Questions modeling	28	1596	July 7, 2024
Pymc5 out of sample v5 modeling	9	1228	August 20, 2023

How can i sample posteriors from training-data, run posterior predictive on testing data (with complex factor graph)

Related topics