Getting the same prediction when using the PyMC3 data container to generate Bayesian regression prediction using new data

Lita_Lertlumprasert · December 9, 2022, 12:33am

I built the Bayesian regression using PyMC3 package. I’m trying to generate prediction using new data. I used the data container pm.Data() to train the model with the training data, then passed the new data to pm.set_data() before calling pm.sample_posterior_predictive(). The prediction was what I would expect from the training data, not the new data.

Here’s my model:

df_train = df.drop(['Unnamed: 0', 'DATE_AT'], axis=1)

with Model() as model:
    response_mean = []
    x_ = pm.Data('features', df_train) # a data container, can be changed
    t = np.transpose(x_.get_value())
    
    # intercept
    y = Normal('y', mu=0, sigma=6000)
    response_mean.append(y)
    
    # channels that can have DECAY and SATURATION effects
    for channel_name in delay_channels:
        i = df_train.columns.get_loc(channel_name)
        xx = t[i].astype(float)
        
        print(f'Adding Delayed Channels: {channel_name}')
        c = coef.loc[coef['features']==channel_name, 'coef'].values[0]
        s = abs(c*0.015)
        if c <= 0:
            channel_b = HalfNormal(f'beta_{channel_name}', sd=s)
        else:
            channel_b = Normal(f'beta_{channel_name}', mu=c, sigma=s)
        
        alpha = Beta(f'alpha_{channel_name}', alpha=3, beta=3)
        channel_mu = Gamma(f'mu_{channel_name}', alpha=3, beta=1)
        response_mean.append(logistic_function(
            geometric_adstock_tt(xx, alpha), channel_mu) * channel_b)
    
    # channels that have SATURATION effects only
    for channel_name in non_lin_channels:
        i = df_train.columns.get_loc(channel_name)
        xx = t[i].astype(float)
        
        print(f'Adding Non-Linear Logistic Channel: {channel_name}')
        c = coef.loc[coef['features']==channel_name, 'coef'].values[0]
        s = abs(c*0.015)
        if c <= 0:
            channel_b = HalfNormal(f'beta_{channel_name}', sd=s)
        else:
            channel_b = Normal(f'beta_{channel_name}', mu=c, sigma=s)
        
        # logistic reach curve
        channel_mu = Gamma(f'mu_{channel_name}', alpha=3, beta=1)
        response_mean.append(logistic_function(xx, channel_mu) * channel_b)
        
    # continuous external features
    for channel_name in control_vars:
        i = df_train.columns.get_loc(channel_name)
        xx = t[i].astype(float)

        print(f'Adding control: {channel_name}')
        c = coef.loc[coef['features']==channel_name, 'coef'].values[0]
        s = abs(c*0.015)
        if c <= 0:
            control_beta = HalfNormal(f'beta_{channel_name}', sd=s)
        else:
            control_beta = Normal(f'beta_{channel_name}', mu=c, sigma=s)
            
        channel_contrib = control_beta * xx
        response_mean.append(channel_contrib)
        
    # categorical control variables
    for var_name in index_vars:
        i = df_train.columns.get_loc(var_name)
        shape = len(np.unique(t[i]))
        x = t[i].astype('int')
        
        print(f'Adding Index Variable: {var_name}')
        
        ind_beta = Normal(f'beta_{var_name}', sd=6000, shape=shape)
        channel_contrib = ind_beta[x]
        response_mean.append(channel_contrib)
        
    # noise
    sigma = Exponential('sigma', 10)

    
    # define likelihood
    likelihood = Normal(outcome, mu=sum(response_mean), sd=sigma, observed=df[outcome].values)
    
    trace = pm.sample(tune=3000, cores=4, init='advi')

Here’s the beta’s from the model. Notice that ADWORD_SEARCH is one of the most important features:

When I zeroed out ADWORD_SEARCH feature, I got practically identical prediction, which can not be the case:

with model:
    y_pred = sample_posterior_predictive(trace)
    
mod_channel = 'ADWORDS_SEARCH'
df_mod = df_train.copy(deep=True)
df_mod.iloc[12:-12, df_mod.columns.get_loc(mod_channel)] = 0

with model:
    pm.set_data({'features':df_mod})
    y_pred_mod = pm.sample_posterior_predictive(trace)

By zeroeing out ADWORD_SEARCH, I would expect that the prediction would be significantly lower than the original prediction since ADWORD_SEARCH is one of the most important features according to the betas.

I started questioning the model, but it seems to perform well:

MAPE = 6.3%
r2 = 0.7

I also tried passing in the original training data set to pm.setdata() and I got very similar results as well.

This is difference between prediction from training data and new data:

This is the difference between prediction from training data and the same training data using pm.setdata():

Anyone know what I’m doing wrong?

cluhmann · December 9, 2022, 12:56am

Welcome!

Maybe it’s this? From the API docs:

Since v4.1.0 the default value is mutable=False, with previous versions having mutable=True.

In general, I think the idiomatic approach is to use either pm.MutableData or pm.ConstantData to avoid the ambiguity associated with pm.Data.

Lita_Lertlumprasert · December 9, 2022, 11:51pm

Thank you for getting back! I tried passing the mutable=True argument, but got this error. I’m using the newest version of pymc3 (3.11.5) so I’m not sure why I got this error. pm.MutableData() also didn’t work.

cluhmann · December 10, 2022, 7:48am

I would strongly recommend upgrading to version 4 if possible (current version is 4.4). Installation instructions are here.

Topic		Replies	Views
Predictions with PYMC, Dont work on test data set modeling	3	37	January 18, 2025
Data container question Questions	1	421	February 26, 2020
Observed and Simulated data difference and prediction of unseen data v5	0	304	August 2, 2023
Automatic imputation and posterior predictive sampling with new data Questions	5	2011	October 29, 2020
Sample_posterior_predicitve not catching shape of new data v5 prediction	10	1233	August 24, 2022

Getting the same prediction when using the PyMC3 data container to generate Bayesian regression prediction using new data

Related topics