PyMC v5.10.3 prediction stuff

Hello everyone.
I have been studying pymc version 5 for a few weeks now. My goal is to build a simple linear regression model and make out-of-sample predictions.
Reading the documentation and other community posts, I have written this short and simplified code.

with pm.Model(coords=coords) as model:
    y = pm.MutableData('targets', df['y'].values.squeeze())

    contributions = []

    Z = pm.MutableData('control_data', df_controls.values)
    control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
        
    for w in range(n_controls):
	z = Z[:,w]*control_betas[w]
	contributions.append(z)
               
    mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['all_vars'])
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat = pm.Normal("y_hat", mu=mu.sum(axis=-1), sigma=sigma, observed=target, shape=Z.shape[0])


with model:
    data = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'all_vars']}})

z_test = df_x_test.values

with model:
    pm.set_data({"control_data":z_test})
    idata.extend(pm.sample_posterior_predictive(idata))


y_hat_predicted = idata.posterior_predictive['y_hat'].mean(dim=["chain", "draw"])

The code works correctly, I get the new out-of-sample y, but I would also like to have predictions of the contributions of the individual variables, not just the y.
Is this possible?
I tried looking in the idata object but without success.
Can someone kindly point out where I am going wrong in the code or what code I need to insert to also get these predictions?
Thanks in advance to anyone who can help.

It sounds like you are looking for the posterior distribution of your control_beta parameters? You can view their estimates by calling az.summary(idata, var_names = “control_beta”), and you can also get their estimates by using idata[“posterior”].mean(dim=["chain", "draw"])

1 Like

You can specify var_names=["mu", "y_hat"] to get both y and the deterministic in sample_posterior_predictive

@teoretico out of curiosity what resources are you using to study PyMC?

Thank you, Riccardo. I tried to correct the code like this:

with model:
    pm.set_data({"control_data":z_test})
    idata.extend(pm.sample_posterior_predictive(idata,  var_names=["y_hat", "mu"]))

but I obtained the following error:

KeyError: ‘mu’

If I leave only y_hat, the code runs normally and I am able to extract the values, but if I add ‘mu’, it goes into error. ‘Mu’ is correctly present in the model above:

mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['all_vars'])

How can I solve this issue?
Thank you!

Hi Riccardo.
I use the official documentation
https://www.pymc.io/
and I also watch several YT videos (for instance: https://www.youtube.com/watch?v=UznM_-_760Y) or web articles (for instance: https://towardsdatascience.com/bayesian-marketing-mix-modeling-in-python-via-pymc3-7b2071f6001a)
I also looked at and tested PyMC marketing. It’s very useful, but I prefer to maintain maximum flexibility by customizing my models. I hope I have been helpful.

1 Like

Sorry I read the name of the variable and not the Deterministic, usually I give the same name to avoid confusion. In you case instead of "mu", you want "contributions" (the name of the Deterministic) in var_names

Thanks for sharing. I see you are focused in MMM. I was asking because the way you wrote your model seemed a bit unusual for simpler models, but may make sense for extended MMM

Hi Ricardo. Thank you for your answer
Now it seems there’s a shape issue…
…maybe I’m getting lost in a glass of water…
I’ll present the entire code to you below.

control_vars = ['c1','c2', 'c3', 'c4', 'c5' ] 
target = df_scaled['y'].to_numpy()
df_controls  = df_scaled[control_vars]
n_obs, n_controls = df_controls.shape

coords = {'controls':control_vars, 
          'all_vars':control_vars}

with pm.Model(coords=coords) as model:
    X = pm.MutableData('control_data', df_controls.values)
    y = pm.MutableData('targets', df_scaled['y'].values.squeeze())

    n_obs = X.shape[0]
    
    contributions = []
    
    control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
    
    for w in range(n_controls):
        x = X[:,w]*control_betas[w]
        contributions.append(x)
         
    mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['controls'])
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat = pm.Normal("y_hat", mu=mu.sum(axis=-1), sigma=sigma, observed=target, shape=X.shape[0])
    
with model:
    idata = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'controls']}})
    

And up to this point, everything is fine, then there’s the out-of-sample part.

x_test = df_test_controls.values
with model:
    pm.set_data({"control_data":x_test})
    idata.extend(pm.sample_posterior_predictive(idata, var_names=["y_hat", "contributions"])) 

But I obtain the following error:

ValueError: conflicting sizes for dimension ‘controls’: length 110 on the data but length 5 on coordinate ‘controls’

the shapes are:
df_test_controls.values = Array of float64 (110,5)
df_controls.values = Array of float64 (110,5)
coords = Dict (5)

It seems there is a mismatch in dimensions. Perhaps a naming issue?
Thank you

PyMC dimensions are just labels for the output of sampling. All broadcasting logic is positionally-like numpy so you need to make sure the variables are aligned correctly like you would do in numpy

Hello Ricardo. Thank you for the suggestions. Actually, the variables are aligned correctly. After several attempts, I run the code without errors modifying this line in the model definition:

before:

mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['all_vars'])

after:

mu = pm.Deterministic("contributions", tt.stack(contributions).T)

That said, I have another element to investigate. Please tell me if I need to open another thread for this.
If I create an out-of-sample loop to sequentially test different values of the model variables with this code:

with model:
        pm.set_data({ "control_data": z_test})
        idata.extend(pm.sample_posterior_predictive(idata, var_names=["y_hat", "contributions"])) 

the output does not change despite changing the input. To make it change, I have to redefine the model from the beginning, run it, and then do the out-of-sample. This is feasible but very time-consuming. Is there any way or command to perform this iteration without starting from scratch? Thank you very much.

Can you share the full code you are using?