Centering data using pm.Deterministic

DrEntropy · February 13, 2025, 10:46pm

What is the ‘pymc’-onic method to express centering in a model to make out of sample prediction easier… right now i am doing something like::

mean_y = pm.Data("mean_y", df["y"].mean())
y = pm.Data("y",  df["y"].to_numpy()) )  
y_c = pm.Deterministic("y_c", y - mean_y)

The idea is that when you replace y (set_data) you keep the old mean that was used to ‘fit’ the model. One issue with this is that the idata will now carry around both y and y_c. The alternative is to do this with some function that transforms the data before setting it in the model.

What is the best practice?

JAB · February 13, 2025, 11:51pm

I am not sure if I am understanding your issue or not, but I guess you are trying to recenter data within the model according to the new data? BTW, you can still update the mean in your current approach by just using pm.set_data for both the data and mean value. Also, you don’t need to wrap the mean in a deterministic if you don’t need to track it.

This works w/o Deterministic:

y = pm.Data("y",  df["y"].to_numpy())  
y_c = y -  df["y"].mean()

If you want to center based on the mean of the new data then you can just do this:

y = pm.Data("y",  df["y"].to_numpy())  
y_c = y - y.mean()

Is it important for your case to recenter the data to the new mean? In most cases I think it’s best practice to center based on your original mean for out of sample prediction. Unless you are specifically trying to do out of model prediction and just reuse the parameters.

DrEntropy · February 14, 2025, 12:13am

Right my intent is to NOT recenter on the new mean, so when I do set_data I wouldl leave that one alone. But now that I see what you wrote I realize I was making it more complicated then it needs to be ! Since I don’t intend to ever do set_data with ‘mean_y’ I can just write it as in your first example. Doh!

jessegrabowski · February 14, 2025, 3:26am

You can also use .eval() to evaluate a sub-graph and get back a number. This will turn it into a constant if you use it in further symbolic computation. So for example:

y = pm.Data('y', df['y'])
y_centered = y - y.mean().eval()

JAB · February 14, 2025, 4:34pm

@jessegrabowski So does this mean that y.mean().eval() would be held constant in any future out of sample/model prediction and that the operation y_centered = y - y.mean().eval() would not get re-evaluted when setting new data with pm.set_data({'y': [....]})?

DrEntropy · February 14, 2025, 5:51pm

Thanks Jesse! Love this community, just listened to you on LBS and here you are answering questions !

DrEntropy · February 15, 2025, 1:20am

y.mean().eval will compute the value and put it in the graph as a constant so wont be effected by future set_data. This little example helped clarify things for me…

import pymc as pm
import numpy as np
import pytensor

x = np.random.randn(5)
 
with pm.Model() as model:
    xdata = pm.Data("x_data",x)
    x_centered =  xdata - xdata.mean()

print("xdata - xdata.mean()")
pytensor.dprint(x_centered)

with pm.Model() as model:
    # Priors
    xdata = pm.Data("x_data",x)
    x_centered =  xdata - xdata.mean().eval()
    
print('-'*20)
print("xdata - xdata.mean().eval()")
pytensor.dprint(x_centered)

with model:
    pm.set_data({"x_data": np.zeros_like(x)})
print('-'*20)
print("After pm.set_data")
pytensor.dprint(x_centered)

jessegrabowski · February 15, 2025, 1:52am

To be clear, it’s totally equivalent to your suggested approach (computing the mean outside of the graph, using the raw data). It might be a teeny tiny bit more readable, because it’s happening “in the model”. But then on the other hand it requires you to understand what pytensor is doing. So in the end, it’s down to preference : )

@DrEntropy Thanks for the kind words, and for the great example. It shows very clearly what’s happening!

ricardoV94 · February 15, 2025, 6:06am

Small tip you can do x_centered.dprint()

Topic		Replies	Views
Pymc5 out of sample v5 modeling	9	1227	August 20, 2023
Per subject mean centered X and y for fixed effects model causalpy	6	28	February 4, 2025
Predict the mu or the observed? Questions modeling	12	480	November 30, 2022
Getting the same prediction when using the PyMC3 data container to generate Bayesian regression prediction using new data Questions theano , modeling	3	508	December 10, 2022
Where to standardize data Questions	2	1769	October 15, 2020

Centering data using pm.Deterministic

Related topics