How can i sample posteriors from training-data, run posterior predictive on testing data (with complex factor graph)

hughlaughter · April 12, 2023, 7:46am

Is it feasible to define the model by utilizing both training and test data when partitioning the data, and subsequently sample posteriors based solely on the training data, followed by running a posterior predictive on the testing data? Alternatively, are we compelled to explicitly retrieve the posteriors and re-map them for predictions?

By referring to a complex mapping, I mean the mapping of certain random variables (RVs) to days, certain RVs to the day of the month, months, etc. If new data is introduced, the modelwill not have the correct information to perform this mapping.

In this scenario, I sense that i am unable to determine how to utilize set_data, as I require the correct parameters to be mapped to the new data.

Consider e.g the following simple example with a datebased mapping:

import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
import pandas as pd
import pytensor.tensor as tt

def generate_input_data(timerange):
    import random
    return np.array([random.uniform(0.5, 0.7) for x in range(len(timerange))]).reshape(-1, 1)


def get_coefficient_val(date, coefficientknots):
    coefficientval = 0
    for knot, val in coefficientknots.items():
        lb, ub = map(int, knot.split("-"))
        if lb <= date.day <= ub:
            coefficientval = val
            break
    return coefficientval

def get_exponent_val(date, exponentknots):
    exponentval = 1.0
    for exponentknot, val in exponentknots.items():
        lb, ub = map(int, exponentknot.split("-"))
        if lb <= date.day <= ub:
            exponentval = val
            break
    return exponentval

def compute_true_values(timerange, coefficientknots, exponentknots):
    true_values = []
    for idx, date in enumerate(timerange):
        coefficientval = get_coefficient_val(date, coefficientknots)
        exponentval = get_exponent_val(date, exponentknots)
        if coefficientval == 0 or exponentval == 1.0:
            raise ValueError("Invalid coefficient or exponent value")
        val = generate_input_data([date])[0]
        entry = coefficientval * val ** exponentval
        true_values.append(entry)
    return np.array(true_values)

def run_model(timerange, coefficientknots, exponentknots, train_input, train_labels, test_input, test_labels):
    with pm.Model() as model:
        train_labels = pm.Data(f'convresponse', train_labels)
        train_input = pm.Data('input', train_input)

        coefficientknots_ = {}
        for interval in coefficientknots:
            knotcoefficient = pm.Normal(f'{interval}-knot', 0, 1)
            coefficientknots_[interval] = knotcoefficient

        exponentknots_ = {}
        for interval in exponentknots:
            knotexponent = pm.Gamma(f'{interval}-exponentknot', 3, 3)
            exponentknots_[interval] = knotexponent

        coefficients = []
        exponents = []
        for date in timerange:
            coefficientknot = next(val for key, val in coefficientknots_.items()
                                   if int(key.split("-")[0]) <= date.day <= int(key.split("-")[1]))
            exponentknot = next(val for key, val in exponentknots_.items()
                                if int(key.split("-")[0]) <= date.day <= int(key.split("-")[1]))

            coefficients.append(coefficientknot)
            exponents.append(exponentknot)

        coefficients = tt.as_tensor_variable(coefficients)
        exponents = tt.as_tensor_variable(exponents)

        modelledresponse = np.power(coefficients * train_input, exponents)


        sigma = pm.HalfCauchy(name="sigma", beta=10)
        _ = pm.Normal(name="outcome",
                      mu=modelledresponse,
                      sigma=sigma,
                      observed=train_labels)

        trace = pm.sample()


def main():
    timerange = pd.date_range(start="2023-01-01", end="2023-03-24", freq='D')
    coefficientknots = {'0-15': 2.0, '15-32': 3.0}
    exponentknots = {'0-15': 0.9, '15-32': 0.3}
    input = generate_input_data(timerange)
    true_values = compute_true_values(timerange, coefficientknots, exponentknots)

    train_input = input[:140].squeeze()
    train_labels = true_values[:140].squeeze()

    test_input = input[140:].squeeze()
    test_labels = input[140:].squeeze()

    run_model(timerange, coefficientknots, exponentknots, train_input, train_labels, test_input, test_labels)


if __name__ == '__main__':
    main()

I need to map the test_inputs to the correct dates over the horizon which i wish to predict over.

jessegrabowski · April 12, 2023, 10:51am

I think it would be useful if you could provide a simple example where you’re struggling with set_data, as the question is quite abstract. In general, everything your model needs to get parameters to the right spot is data, so you need to either put it all into mutable data containers, or re-structure your data so you don’t need the indices.

hughlaughter · April 13, 2023, 3:11am

updated the question with a small example, the example is very simplistic however. I think based on your answer however that we could restructure the data in some sort of way but lets say i have some term that modelles a carry-over effect between datapoints, it can get really really complex, there is no way we could just run the whole “mapping” once over both the train and testing data and then just draw posteriors based on the training-data and subsequently sample a posterior predictive based on the testingdata?

jessegrabowski · April 13, 2023, 10:33am

You can definitely run the whole mapping over everything, why not? You just then need to split the map into train and test parts, and include data objects for the map in your model. I see a lot of places where the example you posted can be simplified.

First, I organized the data into a DataFarme, and used pd.cut to quickly get the bin indexes. The goal is to vectorize all the operations you’re currently doing with loops. To do that, I want an integer array that maps each datapoint to it’s knot based on the calendar day (which we can grab using the .day property of the DateTimeIndex)

# I increased the timerange because it didn't actually make it to 140 observations
# in your example. I also renamed "input" to "data" -- a personal preference.

timerange = pd.date_range(start="2023-01-01", end="2023-07-24", freq='D')
coefficientknots = {'0-15': 2.0, '15-32': 3.0}
exponentknots = {'0-15': 0.9, '15-32': 0.3}
data = generate_input_data(timerange)
true_values = compute_true_values(timerange, coefficientknots, exponentknots)

# Gather stuff into a dataframe for organization
df = pd.DataFrame(np.c_[data, true_values], index=timerange, columns=['data', 'targets'])
df['day'] = df.index.day

# Create the map by assiging a bin index to each datapoint based on which day it falls into
knot_cutpoints = [0, 15, 32]
df['knot_idx'] = pd.cut(df['day'], knot_cutpoints, labels=False)

# Train-test split includes the knot mapping
df_train = df.iloc[:140].copy()
df_test = df.iloc[140:].copy()

The next step is to set up the coords for the model. It’s not strictly necessary, but highly recommended. It will make our outputs nicer and the code more readable.

# These names will be used to make the model output look nicer
knot_names = [f'{a}-{b}' for a, b in zip(knot_cutpoints[:-1], knot_cutpoints[1:])]

# Coords let us refer to dimension by their name, rather than having to remember their shapes
coords = {'knot':knot_names}

Next is the model. First, I changed usages of pm.Data to pm.MutableData, because we’re going to want to swap this data out and do out-of-sample prediction. Second, I renamed things to emphasize that they are arbitrary data containers that will hold more than just training data.

I use the dimension named in coords to draw multiple times from the same priors, eliminating the loops. Since I pre-computed the map from row to knot, I use fancy indexing to put the right coefficients with the right data.

Finally is the line I labeled “Important!”. It’s good practice to link the shape of the outcome variable to the shape of the input data. This prevents shape errors when you do pm.set_data on the input but not the output (i.e. when you have no output, i.e. for out-of-sample prediction). More on this here.

with pm.Model(coords=coords) as model:
    data_pt = pm.MutableData('input', df_train.data)
    targets_pt = pm.MutableData(f'convresponse', df_train.targets)
    
    # Here is the mapping; when we do out-of-sample prediction we'll update this as well
    knot_idx_pt = pm.MutableData('knot_idx', df_train.knot_idx)

    # Using coords, draw all the parameters for all knots at once
    knot_coef = pm.Normal(f'knot_coef', 0, 1, dims=['knot'])
    knot_exponent = pm.Gamma(f'knot_exponent', 3, 3, dims=['knot'])
    
    # Use the mapping defined above to assign the correct coefficient to the 
    # correct observation
    modelledresponse = tt.power(knot_coef[knot_idx_pt] * data_pt, 
                                knot_exponent[knot_idx_pt])

    sigma = pm.HalfCauchy(name="sigma", beta=10)
    outcome = pm.Normal(name="outcome",
                  mu=modelledresponse,
                  sigma=sigma,
                  observed=targets_pt,
                        shape = data_pt.shape[0], # Important!
                       initval={'knot_exponent':[1, 1]}) # I found sometimes the exponents blow up

    idata = pm.sample()

Finally, do prediction on the test set using pm.set_data

with model:
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)
    pm.set_data({'input':df_test.data, 'knot_idx':df_test.knot_idx})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)

Here’s what I get:

hughlaughter · April 13, 2023, 10:35am

i love this community… amazing response. 5/5

jessegrabowski · April 13, 2023, 10:42am

I edited some small mistakes, but glad to help.

I understand your real model is much more complex, but hopefully this illustrates the principle. For each row of your data, make an integer array that maps it to the coefficient dimension you care about (in this case, which day-knot goes to which row), then include a pm.MutableData for the integer array, which you will use to fancy index something somewhere in the model.

hughlaughter · April 20, 2023, 2:35pm

I have begun implementing the changes you suggested in my original code, but I have encountered a problem that I am struggling to resolve elegantly. I have found a solution mapped up in my brain but I believe there may be a way more efficient approach. Additionally, the indexing of theano tensors, specifically the random variables, is a new concept to me, and I find it challenging to comprehend even though i have read the docs.

To further illustrate the issue, I have modified your provided code and introduced new variables to demonstrate the problem. In this instance, we have random variables that correspond to specific dates. Consequently, we require a method for indexing subsets of the training data and concatenating or appending them to the non-special cases, or alternatively, devising a more sophisticated solution. As the number of special cases increases, the indexing process becomes increasingly complicated.

I would appreciate any guidance on how to tackle this issue, or if possible, a reference to any related work that could provide me with insights on this topic.

thanks in advance.

def main():
    # I increased the timerange because it didn't actually make it to 140 observations
    # in your example. I also renamed "input" to "data" -- a personal preference.

    timerange = pd.date_range(start="2023-01-01", end="2023-12-31", freq='D')
    coefficientknots = {'0-15': 2.0, '15-32': 3.0}
    exponentknots = {'0-10': 0.9, '10-32': 0.3}

    specialdates = True
    if specialdates:
        coefficients_special_date_knots = {'0-24': 6.0, '24-32': 4.0}
        exponent_special_date_knots = {'0-19': 0.95, '19-32': 0.6}
    else:
        coefficients_special_date_knots = {}
        exponent_special_date_knots = {}

    data = generate_input_data(timerange)
    true_values = compute_true_values(timerange, coefficientknots, exponentknots,
                                      coefficients_special_date_knots, exponent_special_date_knots)

    # Gather stuff into a dataframe for organization
    df = pd.DataFrame(np.c_[data, true_values], index=timerange, columns=['data', 'targets'])
    df['day'] = df.index.day

    mask = (df.index.month == 12) & (df.index.day >= 0) & (df.index.day <= 32)
    
    # Create the map by assiging a bin index to each datapoint based on which day it falls into
    coefficients_knot_cutpoints = [0, 15, 32]
    df['coefficient_knot_idx_normal'] = pd.cut(df['day'], coefficients_knot_cutpoints, labels=False)
    df.loc[mask, 'coefficient_knot_idx_normal'] = -1
    
    exponent_knot_cutpoints = [0, 10, 32]
    df['exponent_knot_idx_normal'] = pd.cut(df['day'], exponent_knot_cutpoints, labels=False)
    df.loc[mask, 'exponent_knot_idx_normal'] = -1

    coefficient_knot_names = [f'{a}-{b}' for a, b in zip(coefficients_knot_cutpoints[:-1],
                                                         coefficients_knot_cutpoints[1:])]

    # These names will be used to make the model output look nicer
    exponent_knot_names = [f'{a}-{b}' for a, b in zip(exponent_knot_cutpoints[:-1],
                                                      exponent_knot_cutpoints[1:])]

    exponent_knot_cutpoints_special = [0, 24, 32]
    exponent_knot_names_special = [f'{a}-{b}' for a, b in zip(exponent_knot_cutpoints_special[:-1],
                                                              exponent_knot_cutpoints_special[1:])]
    df['exponent_knot_idx_special'] = -1
    df.loc[mask, 'exponent_knot_idx_special'] = pd.cut(df.loc[mask, 'day'],
                                                       exponent_knot_cutpoints_special,
                                                       labels=False)

    coefficients_knot_cutpoints_special = [0, 19, 32]
    coefficients_knot_names_special = [f'{a}-{b}' for a, b in zip(coefficients_knot_cutpoints_special[:-1],
                                                                  coefficients_knot_cutpoints_special[1:])]
    df['coefficient_knot_idx_special'] = -1
    df.loc[mask, 'coefficient_knot_idx_special'] = pd.cut(df.loc[mask, 'day'],
                                                          coefficients_knot_cutpoints_special,
                                                          labels=False)

    # Coords let us refer to dimension by their name, rather than having to remember their shapes
    coords = {'exponent_knot_normal': exponent_knot_names,
              'exponent_knot_special': exponent_knot_names_special,
              'coefficient_knot_normal': coefficient_knot_names,
              'coefficient_knot_special': coefficients_knot_names_special}

    # Train-test split includes the knot mapping
    df_train = df.iloc[:140].copy()
    df_test = df.iloc[140:].copy()


    with pm.Model(coords=coords) as model:
        data_pt = pm.MutableData('input', df_train.data)
        targets_pt = pm.MutableData(f'convresponse', df_train.targets)

        # Here is the mapping; when we do out-of-sample prediction we'll update this as well
        coefficient_knot_idx_normal_pt = pm.MutableData('coefficient_knot_idx_normal', df_train.coefficient_knot_idx_normal)
        exponent_knot_idx_normal_pt = pm.MutableData('exponent_knot_idx_normal', df_train.exponent_knot_idx_normal)

        coefficient_knot_idx_special_pt = pm.MutableData('coefficient_knot_idx_special',
                                                        df_train.coefficient_knot_idx_special)
        exponent_knot_idx_special_pt = pm.MutableData('exponent_knot_idx_special', df_train.exponent_knot_idx_special)

        # Using coords, draw all the parameters for all knots at once
        knot_coef_normal = pm.Normal(f'knot_coef', 0, 1, dims=['coefficient_knot_normal'])
        knot_exponent_normal = pm.Gamma(f'knot_exponent', 3, 3, dims=['exponent_knot_normal'])

        # Using coords, draw all the parameters for all knots at once
        knot_coef_special = pm.Normal(f'knot_coef_special', 0, 3, dims=['coefficient_knot_special'])
        knot_exponent_special = pm.Gamma(f'knot_exponent_special', 1, 3, dims=['exponent_knot_special'])

        # concatenate the two above in a smart way?
        # and fix the mapping accordingly? ->
        # assuming we have even more such specials i think there could be a smarter way?
        
        modelledresponse = tt.power(knot_coef_normal[coefficient_knot_idx_normal_pt] * data_pt,
                                    knot_exponent_normal[exponent_knot_idx_normal_pt])

        sigma = pm.HalfCauchy(name="sigma", beta=10)
        outcome = pm.Normal(name="outcome",
                            mu=modelledresponse,
                            sigma=sigma,
                            observed=targets_pt,
                            shape=data_pt.shape[0],  # Important!
                            initval={'knot_exponent': [1, 1]})  # I found sometimes the exponents blow up

        idata = pm.sample()

    with model:
        idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)
        pm.set_data({'input': df_test.data, 'knot_idx': df_test.knot_idx})
        idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)


if __name__ == '__main__':
    main()

jessegrabowski · April 20, 2023, 3:28pm

I see two issues mixed up in the question, one related to indexing and one related to modeling.

For indexing, it seems like you mostly have everything well in hand. Actually all you are doing here is making indicator features for all the different types of dates you want to highlight. I agree the fancy-indexing method takes some getting used to. If you’re more used to the dummy variables approach, you can try some experiments to convince yourself the two are exactly the same. As with everything related to Bayesian statistics, there’s a nice Richard McElreath lecture that goes over what indicator variables are and how they are used. Perhaps it will be useful?

This is essentially feature engineering, and there’s not really any getting away from it, index variables or no.

Where I think you are going wrong is with the -1 assignment to December (day > 0 & day < 32 is not needed, since all days are in this range). This isn’t going to make the model ignore December, it’s going to grab the last element of the parameter vector and give it to December (think about what ndarray[-1] signifies).

If you want December to have a different parameter from all the rest, you can make a offset parameter and add it to the base coefficient. Something like:

(knot_coef_normal[coef_idx_normal] + december_knot_coef * is_december)  * data_pt

Where is_december is a vector of 1’s and 0’s. Here I use multiplication with an indicator vector because you don’t want any effect on the non-december knots, so you can’t really use a lookup (well you could, but you’d have to concatenate a zero to the coefficients and I think that’s more trouble than it’s worth).

This segues into modeling part. You have to decide yourself how things are going to go together. In the above example, I assumed the coefficients combine linearly, so that December is the same as everyone else, plus or minus some offset. You can carry on like this for as many effects as you want. You could also entertain some other functional form, but linear is always the best place to start. Given the code you posted, you might try:

        base_coef = knot_coef_normal[coefficient_knot_idx_normal_pt]+ knot_coef_special[coefficient_knot_idx_special_pt]
        exp_coef = knot_exponent_normal[exponent_knot_idx_normal_pt] + knot_exponent_special[exponent_knot_idx_special_pt]

        modelledresponse = tt.power(base_coef * data_pt, exp_coef)

Where ultimately base_coef and exp_coef could include as many additive terms as you wish.

hughlaughter · April 23, 2023, 10:22am

hiya, could you please provide the code you used for plotting the predictions, seems like these plots changed a bit since pymc3, did you manually sort it(.mean() per chain to get it from 3d-2d and then plotting) out or do we have access to a az function that produces this directly, couldnt find any in the docs beside az.plot_ppc but its a bit different…

jessegrabowski · April 23, 2023, 11:05am

I didn’t use arviz. It’s the same as the code i posted here:

Topic		Replies	Views
Get samples of posterior on new data mapped with set_data	6	184	November 15, 2023
Get predictions from a model using the MAP values instead of sampling the posterior v5 development , modeling	1	412	October 5, 2023
How to make out-of-sample predictions with pymc model v5	1	476	February 8, 2023
Could somebody provide a minimal example for sample_posterior_predictive() Questions	2	339	April 15, 2021
Making predictions on model inputs based on observed outputs Questions	3	645	March 18, 2021

How can i sample posteriors from training-data, run posterior predictive on testing data (with complex factor graph)

Related Topics