Pymc5 out of sample

I’m working to migrate my code from pymc3 to pymc5 but I have found some problem to adapt out of sample prediction

I have the code below:

features [f1,f2]
target = df[target].values
basic_model =pm.Model()

with basic_model:
    response_mean = []
    
    for f in features 
        ff= df[f].values
        beta = pm.HalfNormal(f'coeff_{f}', sigma = 2)
        decay = pm.Beta(f'decay_{f}', alpha=3, beta=3)
        contribution = pm.Deterministic(f'contribution_{f}', (ff*decay*beta)
        response_mean.append(contribution)
                    
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    likelihood =  pm.Normal("likelihood", mu=sum(response_mean), sigma=sigma, observed=target)   

with basic_model:
    idata = pm.sample(return_inferencedata=True, tune= 1000)

Now I would like to run out-of-sample prediction starting from new values of f1 and f2 Looking at documentation [Out-Of-Sample Predictions — PyMC example gallery][1] I have to use data containers (pm.MutableData) but I am not able to integrate those containers in my code. Is there anyone who can give me a suggestion?

Since you just have a linear model, there’s no need to have a loop over the features. It’s best practice to vectorize everywhere you can. In this case, it also has the advantage of making it more apparent how to use the data containers. Here is your model using broadcast multiplication across data containers:

coords = {'feature':['feature_1', 'feature_2']}

basic_model =pm.Model(coords=coords)

with basic_model:
    X = pm.MutableData('X', X_train.values)
    y = pm.MutableData('y', y_train.values)
    
    betas = pm.HalfNormal('coeff', sigma=2, dims=['feature'])
    decay = pm.Beta('decay', alpha=3, beta=3, dims=['feature'])
    contribution = pm.Deterministic('contribution', X * decay * betas, dims=[None, 'feature'])
    
    mu = contribution.sum(axis=-1)
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    obs = pm.Normal("y_hat", mu=mu, sigma=sigma, observed=y, size=X.shape[0])   

with basic_model:
    idata = pm.sample(init='jitter+adapt_diag_grad')
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)

Note the size=X.shape[0] line in the likelihood term. Now you can get predictions for your out of sample values:

with basic_model:
    pm.set_data({'X':X_test, 'y':y_test})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)

And made a KDE plot. The PPD looks good in this case, since the test values are from the same distribution as the training values (I just generated some samples and split it with sklearn.model_selection.train_test_split). I had to hack together the PPD because arviz doesn’t support using the predictions group to plot az.plot_ppd.

import seaborn as sns
fig, ax = plt.subplots()
sns.kdeplot(y_test.values, ax=ax, color='k', lw=2, zorder=10000)
ppd_test = idata.predictions.y_hat.stack(sample=['chain', 'draw']).values

for sample in ppd_test.T[:250]:
    sns.kdeplot(sample, ax=ax, color='tab:blue', alpha=0.15)
ax.set_title('Out of Sample PPC')
plt.show()

image

Another application is to sample the space of observed data more densely, and use it to draw nice HDIs. Here I marginalize over each feature in turn (aka “set it to its mean value”), use set_data to sample from the model, and plot the resulting HDI, along with the training and test data.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(14,4), sharey=True)

for i, axis in enumerate(fig.axes):
    x = X_train.values[:, i]
    axis.scatter(x, y_train, label='Training')
    axis.scatter(X_test.values[:, i], y_test.values, color='tab:orange', label='Test')

    x_mean = np.ones((1000, 2)) * X_train.values.mean(axis=0)
    x_grid = np.linspace(x.min(), x.max(), 1000)
    x_mean[:, i] = x_grid

    with basic_model:
        pm.set_data({'X':x_mean})
        temp_idata = pm.sample_posterior_predictive(idata)
        hdi = az.hdi(temp_idata.posterior_predictive, hdi_prob=0.69).y_hat

    axis.fill_between(x_grid, *hdi.values.T, alpha=0.25, color='tab:blue', label = '69% HDI')
    axis.set(xlabel=f'Feature {i}')

ax[0].legend()
plt.show()

Note that I didn’t need to supply a value of y when I did pm.set_data in this code block. That’s because I passed shape=X.shape[0] when I built the model, so PyMC is smart enough to dynamically resize y_hat. If you omitted that, you would have to pass in some placeholder values of the correct size for y.

3 Likes

Hi Jesse! Thank you very much and really usell. It works!! :slight_smile:
Your answer “gives birth to” two different topic:

  • is there any chance in the prediction part of the code to get the predicted value of my features (not only y_hat? I am looking at the idata object but I find only y_yat

  • as you mentioned, this model is quite easy, but if I would like to incorporate different functions, what is your suggestion.

Please find below a more complex code where I would like to apply predictions

def geometric_adstock_tt(x_t, alpha=0, L=21, normalize=True):
    w = tt.as_tensor_variable(
        [tt.power(alpha, i) for i in range(L)]
    )
    
    xx = tt.stack(
        [tt.concatenate([
            tt.zeros(i),
            x_t[:x_t.shape[0]-i]
        ]) for i in range(L)]
    )
    
    if not normalize:
        y=tt.dot(w, xx)
    else:
        y=tt.dot(w/tt.sum(w),xx)
    
    return y


basic_model =pm.Model()

with basic_model:
    response_mean = []
    for feature in features:
        xx = df_newfeature
        beta = pm.HalfNormal(f'beta_{feature}', sigma = 2)
        decay = pm.Beta(f'decay_{feature }', alpha=3, beta=3)
        contribution = pm.Deterministic(f'contribution_{feature}', (geometric_adstock_tt(xx, decay))*beta)
        response_mean.append(contribution)
        
        
    
    if control_vars:
        for control in control_vars:
            x = df_new[control].values
            print (f'Adding control: {control}')
            control_beta = pm.Normal(f'coeff_{control}', sigma = 2)
            control_contribution = pm.Deterministic(f'contribution_{control}', (control_beta * x))
            response_mean.append(control_contribution)
            
    
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    likelihood =  pm.Normal("likelihood", mu=sum(response_mean), sigma=sigma, observed=target)

do you suggest to continue to not loop over features even if now I have a tensor and different calculation among different types of features?

any suggestion is really appreciate. :slight_smile:
thank you!

In the end it just comes down to preference I suppose. You can still iterate over the columns of a MutableData object using range. I’d split df into features and controls before modeling, make each its own data container then do the looping as follows (warning: untested code)

df_features  = df[feature_vars]
n_obs, n_features = df_features.shape
coords = {'features':feature_vars, 
          'all_vars':feature_vars}

if control_vars:
    df_controls  = df[control_vars]
    _, n_controls = df_controls.shape
    coords.update({'controls':control_vars,
                   'all_vars':feature_vars + control_vars})


with pm.Model(coords=coords) as basic_model:
    X = pm.MutableData('feature_data', df_features.values)
    y = pm.MutableData('targets', df.target.values)

    n_obs = X.shape[0]
    
    betas = pm.HalfNormal('beta', sigma = 2, dims=['features'])
    decays = pm.Beta('decay', alpha=3, beta=3, dims=['features'])

    contributions = pt.zeros((n_obs, n_features + n_controls))
    
    for i in range(n_features):
       x = geometric_adstock_tt(X[:, i], decays[i]))*beta[i]
       contributions = pt.set_subtensor(contributions[:, i], x)
   
    if n_controls > 0:
        Z = pm.MutableData('control_data', df_controls.values)
        control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
        pt.set_subtensor(contributions[:, n_features:],  Z @ control_betas)
    
    pm.Deterministic("contributions", contributions)
     
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat=  pm.Normal("y_hat", mu=contributions.sum(axis=-1), sigma=sigma, observed=y, shape=X.shape[0])
     idata = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'all_vars']}})

It’s basically the same as yours but I use the data containers. I also wanted to use coords to tag the different dimensions instead of f-strings – this is helpful when plotting with arviz, because it makes it easy to ask for a subset of coordinates. One wrinkle here was that I don’t know a way to only label a subset of the dimensions – for contributions, we want to label only the 2nd dimension (since the 1st dimension is a batch dim). As a result, I had to include the idata_kwargs argument in pm.sample to get the right coords on contributions.

Again, this is all just a matter of preference. Another way to do it would be to just include pm.MutableData in your loops, and use an f-string to name each mutable data object. The downside of this approach is that you will need to include each variable individually when you call pm.set_data, though admittedly you could write a function to ease that pain point.

I also didn’t look very carefully at the geometric_adstock_tt function, but I’m sure you could vectorize it with a bit of work. That would be my personal approach, but yet again, I emphasize that these are mostly style choices.

Also, have you checked out pymc-marketing? It’s a new project that is focusing on mixed-media models. It might be useful in your case.

1 Like

Hi Jesse, sorry for the late reply. First of all thank you for your help
Unfortunately, I tried applying your code to my case but didn’t get satisfactory results. I attempted to simplify the number of features and controls, but the two models yield extremely different results. I’ll try to write both codes below and also attach the sample database.
PS I will definitely read the pymc marketing package :slight_smile:

My code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import arviz as az
import pymc as pm
import pytensor.tensor as tt 



print(f"Running on PyMC v{pm.__version__}")


#decay
def geometric_adstock_tt(x_t, alpha=0, L=21, normalize=True):
    w = tt.as_tensor_variable(
        [tt.power(alpha, i) for i in range(L)]
    )
    
    xx = tt.stack(
        [tt.concatenate([
            tt.zeros(i),
            x_t[:x_t.shape[0]-i]
        ]) for i in range(L)]
    )
    
    if not normalize:
        y=tt.dot(w, xx)
    else:
        y=tt.dot(w/tt.sum(w),xx)
    
    return y


#logistic
def logistic_function(x_t, mu=0.1):
    return (1-np.exp(-mu*x_t))/(1+np.exp(-mu*x_t))




RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")



# import data
df = pd.read_csv ('scaled.csv', sep=";")
df['Period']=pd.to_datetime(df['Period'])
df = df.set_index('Period')

features = ['feature1', 'feature2']
control_vars = ['control1'] 
target = df['y'].to_numpy()

# bayes model
basic_model =pm.Model()

with basic_model:
    response_mean = []
    
    for feature in features:
        xx = df[feature].values
        print (f'Adding media: {feature}')
        beta = pm.HalfNormal(f'beta_{feature}', sigma = 2)
        decay = pm.Beta(f'decay_{feature}', alpha=3, beta=3)
        sat = pm.Gamma(f'sat_{feature}', alpha=3, beta=1)
        feature_contribution = pm.Deterministic(f'contribution_{feature}', (logistic_function(geometric_adstock_tt(xx, decay),sat)*beta))
        response_mean.append(feature_contribution)
        
        
    
    if control_vars:
        for control in control_vars:
            x = df[control].values
            
            print (f'Adding control: {control}')
            control_beta = pm.Normal(f'coeff_{control}', sigma = 2)
            
            control_contribution = pm.Deterministic(
            f'contribution_{control}', (control_beta * x))
            
            #channel_contrib = control_beta * x
            response_mean.append(control_contribution)
            
   
            
   
    sigma = pm.HalfNormal('sigma', sigma=1)
    likelihood =  pm.Normal("likelihood", mu=sum(response_mean), sigma=sigma, observed=target)
        

with basic_model:
    # draw 1000 posterior samples
    idata = pm.sample(return_inferencedata=True, tune= 1000)
    

with basic_model:
    post = pm.sample_posterior_predictive(idata)


arr = post.posterior_predictive['likelihood']
mean_y = np.mean(arr, axis=(0,1))
mean_y =mean_y.values

Here below your code adjusted by me:


import pandas as pd
import numpy as np
import arviz as az
import pymc as pm
import pytensor.tensor as tt 



print(f"Running on PyMC v{pm.__version__}")


#decay
def geometric_adstock_tt(x_t, alpha=0, L=21, normalize=True):
    w = tt.as_tensor_variable(
        [tt.power(alpha, i) for i in range(L)]
    )
    
    xx = tt.stack(
        [tt.concatenate([
            tt.zeros(i),
            x_t[:x_t.shape[0]-i]
        ]) for i in range(L)]
    )
    
    if not normalize:
        y=tt.dot(w, xx)
    else:
        y=tt.dot(w/tt.sum(w),xx)
    
    return y


#logistic
def logistic_function(x_t, mu=0.1):
    return (1-np.exp(-mu*x_t))/(1+np.exp(-mu*x_t))




RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")



# import data
df = pd.read_csv ('scaled.csv', sep=";")
[scaled.csv|attachment](upload://aVzn3kWmmO8eE3bhm2o6vEOynI7.csv) (10.1 KB)

df['Period']=pd.to_datetime(df['Period'])
df = df.set_index('Period')

features = ['feature1', 'feature2']
control_vars = ['control1'] 
target = df['y'].to_numpy()
df_features  = df[features]
n_obs, n_features = df_features.shape
coords = {'features':features, 
          'all_vars':features}


if control_vars:
    df_controls  = df[control_vars]
    _, n_controls = df_controls.shape
    coords.update({'controls':control_vars,
                   'all_vars':features + control_vars})

with pm.Model(coords=coords) as basic_model:
    X = pm.MutableData('feature_data', df_features.values)
    y = pm.MutableData('targets', df['y'].values)

    n_obs = X.shape[0]
    
    betas = pm.HalfNormal('beta', sigma = 2, dims=['features'])
    decays = pm.Beta('decay', alpha=3, beta=3, dims=['features'])
    sat = pm.Gamma('sat', alpha=3, beta=1, dims=['features'])
    contributions = tt.zeros((n_obs, n_features + n_controls))
    
    
    
    
    for i in range(n_features):
       x = logistic_function(geometric_adstock_tt(X[:, i], decays[i]),sat[i])*betas[i]
       contributions = tt.set_subtensor(contributions[:, i], x)
   
    if n_controls > 0:
        Z = pm.MutableData('control_data', df_controls.values)
        control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
        tt.set_subtensor(contributions[:, n_features:],  Z @ control_betas)
    
    pm.Deterministic("contributions", contributions)
     
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat=  pm.Normal("y_hat", mu=contributions.sum(axis=-1), sigma=sigma, observed=y, shape=X.shape[0])
    

with basic_model:
    # draw 1000 posterior samples
    idata = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'all_vars']}})
    

with basic_model:
    post = pm.sample_posterior_predictive(idata)
    
arr = post.posterior_predictive['y_hat']
mean_y = np.mean(arr, axis=(0,1))
mean_y =mean_y.values

Every suggestion is greatly appreciated.
All the best

When debugging models it’s useful to use pm.model_to_graphviz to check for any obvious problems. In this case, the second model shows this graph:

As you can see, control_data and control_betas are not connected to anything. This is because the tt.set_subtensor line is not being saved, it should read contributions = tt.set_subtensor(contributions[:, n_features:], Z @ control_betas)

Looking at this graph and playing with it lead me to re-write the model slightly, by 1) assigning the deterministic to a variable, and 2) use a list for contributions and .append the elements to it, similarly to your original model. I think this is a bit easier to follow:

df_features  = df[features]
n_obs, n_features = df_features.shape
coords = {'time':df_features.index,
          'features':features, 
          'all_vars':features}

if control_vars:
    df_controls  = df[control_vars]
    _, n_controls = df_controls.shape
    coords.update({'controls':control_vars,
                   'all_vars':features + control_vars})

with pm.Model(coords=coords) as basic_model_2:
    X = pm.MutableData('feature_data', df_features.values, dims=['time', 'features'])
    y = pm.MutableData('targets', df['y'].values.squeeze(), dims=['time'])

    n_obs = X.shape[0]
    
    betas = pm.HalfNormal('beta', sigma = 2, dims=['features'])
    decays = pm.Beta('decay', alpha=3, beta=3, dims=['features'])
    sat = pm.Gamma('sat', alpha=3, beta=1, dims=['features'])
    contributions = []
    
    for i in range(n_features):
        x = logistic_function(geometric_adstock_tt(X[:, i], decays[i]),sat[i])*betas[i]
        contributions.append(x)
        
    if n_controls > 0:
        Z = pm.MutableData('control_data', df_controls.values, dims=['time', 'controls'])
        control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
        contributions.append(Z @ control_betas)
    
    mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['time', 'all_vars'])
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat = pm.Normal("y_hat", mu=mu.sum(axis=-1), sigma=sigma, observed=y, shape=X.shape[0], dims=['time'])

Which generates the following graph:

Which should be equivalent to yours. Note that I set a coord on the time index, so you’ll have to pass an update for that when you do pm.set_data for the out-of-sample.

Thank you, Jessie. After several weeks, I managed to pick up the project again, and thanks to your guidance, I understood where the error was in the model’s definition. Now it’s working properly! Thanks. Unfortunately, I find myself back in the initial situation now, and I continue to experience an error during the out-of-sample phase.

Here the graph

Here the model code:

features = ['feature1', 'feature2']
control_vars = ['seasonal','dummy'] 
target = df_new['y'].to_numpy()
df_features  = df_new[features]
n_obs, n_features = df_features.shape
coords = {'features':features, 
          'all_vars':features}

if control_vars:
    df_controls  = df_new[control_vars]
    _, n_controls = df_controls.shape
    coords.update({'controls':control_vars,
                   'all_vars':features + control_vars})

with pm.Model(coords=coords) as basic_model:
    X = pm.MutableData('feature_data', df_features.values)
    y = pm.MutableData('targets', df_new['y'].values.squeeze())

    n_obs = X.shape[0]
    
    betas = pm.HalfNormal('beta', sigma = 2, dims=['features'])
    decays = pm.Beta('decay', alpha=3, beta=3, dims=['features'])
    sat = pm.Gamma('sat', alpha=3, beta=1, dims=['features'])
    contributions = []
    
    for i in range(n_features):
        x = logistic_function(geometric_adstock_tt(X[:, i], decays[i]),sat[i])*betas[i]
        contributions.append(x)
        
    if n_controls > 0:
        Z = pm.MutableData('control_data', df_controls.values)
        control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
        
        for w in range(n_controls):
            z = Z[:,w]*control_betas[w]
            contributions.append(z)
    
    mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['all_vars'])
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat = pm.Normal("y_hat", mu=mu.sum(axis=-1), sigma=sigma, observed=y, shape=X.shape[0])
    


with basic_model:
    # draw 1000 posterior samples
    idata = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'all_vars']}})
    #idata = pm.sample(return_inferencedata=True, tune= 1000)   
    
with basic_model:
    post = pm.sample_posterior_predictive(idata)

And it works :slight_smile:

Then the out of sample code:

x_test = df_x_test.values.astype(np.float64)
z_test = df_z_test.values.astype(np.float64)
y_test = test_target

with basic_model_2:
    pm.set_data({"X":x_test, "y":y_test, "Z":z_test})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)

but I get this error :frowning:

Traceback (most recent call last):

  File ~\Anaconda3\envs\pymc5_spyder\lib\site-packages\pymc\model.py:1602 in __getitem__
    return self.named_vars[self.name_for(key)]

KeyError: 'X'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  Cell In[26], line 2
    pm.set_data({"X":x_test, "y":y_test, "z":z_test})

  File ~\Anaconda3\envs\pymc5_spyder\lib\site-packages\pymc\model.py:2048 in set_data
    model.set_data(variable_name, new_value, coords=coords)

  File ~\Anaconda3\envs\pymc5_spyder\lib\site-packages\pymc\model.py:1177 in set_data
    shared_object = self[name]

  File ~\Anaconda3\envs\pymc5_spyder\lib\site-packages\pymc\model.py:1604 in __getitem__
    raise e

  File ~\Anaconda3\envs\pymc5_spyder\lib\site-packages\pymc\model.py:1599 in __getitem__
    return self.named_vars[key]

KeyError: 'X'

The test objects have the same shape as those used in the main model.
It seems like an error, possibly related to labels, but I’m having trouble pinpointing the issue

My ultimate goal is to obtain a dataframe with out-of-sample predictions not only for y_hat but also for the features and control variables.

Thank-you in advance for your help!

The name you give to pm.set_data needs to match the variable name you set on the pm.MutableData, not the name of the python object you provide. So in your case, you need to write something like:

    pm.set_data({"feature_data":x_test, "targets":y_test, "control_data":z_test})

Hi Jesse! thank you.
Now the code is running correctly, and I can perform an out-of-sample run with this code.

with basic_model_2:
    pm.set_data({"feature_data":x_test, "control_data":z_test})
    idata.extend(pm.sample_posterior_predictive(idata))
    
y_hat_predicted = idata.predictions['y_hat'].mean(dim=["chain", "draw"])

However, the ‘idata’ object in ‘predictions’ level only contains the estimated value of y_hat.
It would be fantastic to have the values of the individual elements that compose it.
In other words, the prediction of each feature and each control.

Thank you!

You can use the var_names argument of pm.sample_posterior_predictive to get more stuff, check the docs for more info. This post gives an overview of some common pitfalls with sample_posterior_predictive, and this blog post is the pm.sample_posterior_predictive masterclass.