Centering data using pm.Deterministic

What is the ‘pymc’-onic method to express centering in a model to make out of sample prediction easier… right now i am doing something like::

mean_y = pm.Data("mean_y", df["y"].mean())
y = pm.Data("y",  df["y"].to_numpy()) )  
y_c = pm.Deterministic("y_c", y - mean_y)

The idea is that when you replace y (set_data) you keep the old mean that was used to ‘fit’ the model. One issue with this is that the idata will now carry around both y and y_c. The alternative is to do this with some function that transforms the data before setting it in the model.

What is the best practice?

I am not sure if I am understanding your issue or not, but I guess you are trying to recenter data within the model according to the new data? BTW, you can still update the mean in your current approach by just using pm.set_data for both the data and mean value. Also, you don’t need to wrap the mean in a deterministic if you don’t need to track it.

This works w/o Deterministic:

y = pm.Data("y",  df["y"].to_numpy())  
y_c = y -  df["y"].mean()

If you want to center based on the mean of the new data then you can just do this:

y = pm.Data("y",  df["y"].to_numpy())  
y_c = y - y.mean()

Is it important for your case to recenter the data to the new mean? In most cases I think it’s best practice to center based on your original mean for out of sample prediction. Unless you are specifically trying to do out of model prediction and just reuse the parameters.

1 Like

Right my intent is to NOT recenter on the new mean, so when I do set_data I wouldl leave that one alone. But now that I see what you wrote I realize I was making it more complicated then it needs to be ! Since I don’t intend to ever do set_data with ‘mean_y’ I can just write it as in your first example. Doh!

You can also use .eval() to evaluate a sub-graph and get back a number. This will turn it into a constant if you use it in further symbolic computation. So for example:

y = pm.Data('y', df['y'])
y_centered = y - y.mean().eval()
1 Like

@jessegrabowski So does this mean that y.mean().eval() would be held constant in any future out of sample/model prediction and that the operation y_centered = y - y.mean().eval() would not get re-evaluted when setting new data with pm.set_data({'y': [....]})?

Thanks Jesse! Love this community, just listened to you on LBS and here you are answering questions !

2 Likes

y.mean().eval will compute the value and put it in the graph as a constant so wont be effected by future set_data. This little example helped clarify things for me…

import pymc as pm
import numpy as np
import pytensor

x = np.random.randn(5)
 
with pm.Model() as model:
    xdata = pm.Data("x_data",x)
    x_centered =  xdata - xdata.mean()

print("xdata - xdata.mean()")
pytensor.dprint(x_centered)

with pm.Model() as model:
    # Priors
    xdata = pm.Data("x_data",x)
    x_centered =  xdata - xdata.mean().eval()
    
print('-'*20)
print("xdata - xdata.mean().eval()")
pytensor.dprint(x_centered)

with model:
    pm.set_data({"x_data": np.zeros_like(x)})
print('-'*20)
print("After pm.set_data")
pytensor.dprint(x_centered)
3 Likes

To be clear, it’s totally equivalent to your suggested approach (computing the mean outside of the graph, using the raw data). It might be a teeny tiny bit more readable, because it’s happening “in the model”. But then on the other hand it requires you to understand what pytensor is doing. So in the end, it’s down to preference : )

@DrEntropy Thanks for the kind words, and for the great example. It shows very clearly what’s happening!

2 Likes

Small tip you can do x_centered.dprint()

2 Likes