PyMC v5.10.3 prediction stuff

teoretico · January 7, 2024, 11:17am

Hello everyone.
I have been studying pymc version 5 for a few weeks now. My goal is to build a simple linear regression model and make out-of-sample predictions.
Reading the documentation and other community posts, I have written this short and simplified code.

with pm.Model(coords=coords) as model:
    y = pm.MutableData('targets', df['y'].values.squeeze())

    contributions = []

    Z = pm.MutableData('control_data', df_controls.values)
    control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
        
    for w in range(n_controls):
	z = Z[:,w]*control_betas[w]
	contributions.append(z)
               
    mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['all_vars'])
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat = pm.Normal("y_hat", mu=mu.sum(axis=-1), sigma=sigma, observed=target, shape=Z.shape[0])


with model:
    data = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'all_vars']}})

z_test = df_x_test.values

with model:
    pm.set_data({"control_data":z_test})
    idata.extend(pm.sample_posterior_predictive(idata))


y_hat_predicted = idata.posterior_predictive['y_hat'].mean(dim=["chain", "draw"])

The code works correctly, I get the new out-of-sample y, but I would also like to have predictions of the contributions of the individual variables, not just the y.
Is this possible?
I tried looking in the idata object but without success.
Can someone kindly point out where I am going wrong in the code or what code I need to insert to also get these predictions?
Thanks in advance to anyone who can help.

yahrMason · January 7, 2024, 1:06pm

It sounds like you are looking for the posterior distribution of your control_beta parameters? You can view their estimates by calling az.summary(idata, var_names = “control_beta”), and you can also get their estimates by using idata[“posterior”].mean(dim=["chain", "draw"])

ricardoV94 · January 7, 2024, 1:59pm

You can specify var_names=["mu", "y_hat"] to get both y and the deterministic in sample_posterior_predictive

ricardoV94 · January 7, 2024, 2:03pm

@teoretico out of curiosity what resources are you using to study PyMC?

teoretico · January 7, 2024, 6:36pm

Thank you, Riccardo. I tried to correct the code like this:

with model:
    pm.set_data({"control_data":z_test})
    idata.extend(pm.sample_posterior_predictive(idata,  var_names=["y_hat", "mu"]))

but I obtained the following error:

KeyError: ‘mu’

If I leave only y_hat, the code runs normally and I am able to extract the values, but if I add ‘mu’, it goes into error. ‘Mu’ is correctly present in the model above:

mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['all_vars'])

How can I solve this issue?
Thank you!

teoretico · January 7, 2024, 6:41pm

Hi Riccardo.
I use the official documentation
https://www.pymc.io/
and I also watch several YT videos (for instance: https://www.youtube.com/watch?v=UznM_-_760Y) or web articles (for instance: https://towardsdatascience.com/bayesian-marketing-mix-modeling-in-python-via-pymc3-7b2071f6001a)
I also looked at and tested PyMC marketing. It’s very useful, but I prefer to maintain maximum flexibility by customizing my models. I hope I have been helpful.

ricardoV94 · January 7, 2024, 9:20pm

Sorry I read the name of the variable and not the Deterministic, usually I give the same name to avoid confusion. In you case instead of "mu", you want "contributions" (the name of the Deterministic) in var_names

ricardoV94 · January 7, 2024, 9:23pm

Thanks for sharing. I see you are focused in MMM. I was asking because the way you wrote your model seemed a bit unusual for simpler models, but may make sense for extended MMM

teoretico · January 8, 2024, 12:13pm

Hi Ricardo. Thank you for your answer
Now it seems there’s a shape issue…
…maybe I’m getting lost in a glass of water…
I’ll present the entire code to you below.

control_vars = ['c1','c2', 'c3', 'c4', 'c5' ] 
target = df_scaled['y'].to_numpy()
df_controls  = df_scaled[control_vars]
n_obs, n_controls = df_controls.shape

coords = {'controls':control_vars, 
          'all_vars':control_vars}

with pm.Model(coords=coords) as model:
    X = pm.MutableData('control_data', df_controls.values)
    y = pm.MutableData('targets', df_scaled['y'].values.squeeze())

    n_obs = X.shape[0]
    
    contributions = []
    
    control_betas = pm.Normal('control_beta', sigma = 2, dims=['controls'])
    
    for w in range(n_controls):
        x = X[:,w]*control_betas[w]
        contributions.append(x)
         
    mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['controls'])
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    y_hat = pm.Normal("y_hat", mu=mu.sum(axis=-1), sigma=sigma, observed=target, shape=X.shape[0])
    
with model:
    idata = pm.sample(idata_kwargs={'dims':{'contributions':[None, 'controls']}})

And up to this point, everything is fine, then there’s the out-of-sample part.

x_test = df_test_controls.values
with model:
    pm.set_data({"control_data":x_test})
    idata.extend(pm.sample_posterior_predictive(idata, var_names=["y_hat", "contributions"]))

But I obtain the following error:

ValueError: conflicting sizes for dimension ‘controls’: length 110 on the data but length 5 on coordinate ‘controls’

the shapes are:
df_test_controls.values = Array of float64 (110,5)
df_controls.values = Array of float64 (110,5)
coords = Dict (5)

It seems there is a mismatch in dimensions. Perhaps a naming issue?
Thank you

ricardoV94 · January 8, 2024, 12:17pm

PyMC dimensions are just labels for the output of sampling. All broadcasting logic is positionally-like numpy so you need to make sure the variables are aligned correctly like you would do in numpy

teoretico · January 12, 2024, 8:38am

Hello Ricardo. Thank you for the suggestions. Actually, the variables are aligned correctly. After several attempts, I run the code without errors modifying this line in the model definition:

before:

mu = pm.Deterministic("contributions", tt.stack(contributions).T, dims=['all_vars'])

after:

mu = pm.Deterministic("contributions", tt.stack(contributions).T)

That said, I have another element to investigate. Please tell me if I need to open another thread for this.
If I create an out-of-sample loop to sequentially test different values of the model variables with this code:

with model:
        pm.set_data({ "control_data": z_test})
        idata.extend(pm.sample_posterior_predictive(idata, var_names=["y_hat", "contributions"]))

the output does not change despite changing the input. To make it change, I have to redefine the model from the beginning, run it, and then do the out-of-sample. This is feasible but very time-consuming. Is there any way or command to perform this iteration without starting from scratch? Thank you very much.

ricardoV94 · January 15, 2024, 5:03pm

Can you share the full code you are using?

Topic		Replies	Views
Pymc5 out of sample v5 modeling	9	1228	August 20, 2023
Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1) v5	5	586	March 2, 2023
Out of sample prediction v5 modeling	20	1879	April 7, 2023
How to properly do out-of-sample prediction for hierarchical model v5 modeling , hierarchical , prediction	24	1387	January 25, 2024
How to include coords in out-of-sample prediction? v5 modeling	1	102	July 2, 2024

PyMC v5.10.3 prediction stuff

Related topics