Use distributions for new input in Data containers

nikmich1 · February 15, 2022, 7:23pm

Hello,

I am considering the following use case: I want to learn a function y=f_\theta(x) on historic data (model parameters \theta, features x and target y). For a scenario analysis, I have created a new set of features x_{\text{new}} as distribution (so it is a continuous version of best-case vs. worst-case scenario). Therefore, I would like to input new values x_{\text{new}} as a distribution and then predict values f_\theta(x_{\text{new}}) together with confidence intervals, including the uncertainty both in the model parameters and in x_{\text{new}}.

If x_{\text{new}} would be a normal array, I would use a Data container in the model and then change the data for a new prediction. Is there a best-practice of doing this with distributions? I did not find material on this, so any hint in the right direction is highly appreciated.

ckrapu · February 17, 2022, 12:53pm

This code at the end of this comment is probably pretty close to what you already have, but just to be sure I’ve included it here as a reference for anyone else who wants to do this.

If x_{new} would be a normal array, I would use a Data container in the model

Our assumptions about x_{new} shouldn’t play into whether or not we use the Data container, because we would just be using actual draws of x_{new} to feed into the model to predict. Regardless of how you choose the distribution over x, it seems like you should be able to sample from that distribution and stick the draws into pm.Data.

If you really do want to train on draws of x and predict using a distribution over x without using draws in pm.Data, maybe it would be worth pulling out the parameter estimates from the trace by hand and making a separate posterior predictive model, forcing its parameter values to be the estimates from the first step. I’m unsure as to whether or not there’s a cleaner way to do it.

import numpy as np
import pymc3 as pm

# We assume that we have a linear regression with no intercept
# and a slope of 2. These 4 lines generate simulated data.
x    = np.random.randn(10)
beta = 2.0
eps  = np.random.randn(10) * 0.4
y    = beta*x + eps

# For posterior predictive sampling, we assume
# we won't have actual feature values but instead
# <n_noisy> draws around their mean
n_noisy    = 5
x_new_mean = np.random.randn(10)
x_new      = np.random.randn(n_noisy,10) * 0.2 + x_new_mean

# We fit a standard model using pm.Data to indicate that 
# we will be swapping data in and out later.
with pm.Model() as model:
    feature   = pm.Data('feature', x)
    coef      = pm.Normal('coef', sd=2)
    err_sd    = pm.HalfCauchy('err_sd', beta=0.5)
    response  = pm.Normal('response', mu=feature*coef, sigma=err_sd, observed=y)
    trace = pm.sample()

# Next, we manually draw a new feature vector, tell PyMC3 to replace the old one in the
# graph, and draw the samples. This should be fast because of the way that the posterior
# predictive sampling function is generated + cached.
n_draws    = 50
pp_samples = []

# Here I assume sampling at random from the noisy predictors
for i in range(n_draws):
    idx = np.random.choice(range(n_noisy))
    
    with model:
        pm.set_data({"feature":x_new[idx]})
        pp_samples.append(pm.sample_posterior_predictive(trace=trace, samples=10, progressbar=False))

# Finally, we extract all of the individual response samples and stick them into one big array
response_samples = np.concatenate([d['response'] for d in pp_samples], axis=0)

Topic		Replies	Views
Data container question Questions	1	421	February 26, 2020
Getting the same prediction when using the PyMC3 data container to generate Bayesian regression prediction using new data Questions theano , modeling	3	507	December 10, 2022
How would you use the MAP Parameter Distributions from linear regression to create new datapoints? Questions	2	542	March 30, 2021
Model shape and best practice for sampling prior predictive on model without observed data v5	5	1170	April 3, 2023
Pymc5 out of sample v5 modeling	9	1222	August 20, 2023

Use distributions for new input in Data containers

Related topics