Where to standardize data

mattiasthalen · October 9, 2020, 7:36am

If I want to standardize data, i.e. (values - values.mean())/values.std(), where should I do that, inside or outside the model?

I would love to be able to predict on new values and plot them using az.plot_posterior() and having the non-standardized estimations shown. Is the only way to use pm.Deterministic do calculate the non-standardized parameters and the non-standardized estimated value? If so, how do I plot that estimated value in az.plot_posterior()?

OriolAbril · October 9, 2020, 5:35pm

There are many alternatives, each with different strengths and focuses, personally I am not tied to any and I am not really sure what to recommend either so here goes a quick overview of the alternatives:

using pm.Deterministics
using only standarized values in the model and after inference/forward sampling destandarize the results of the inferencedata object.
Here you can create new variables on the dataset or directly modify the present variables, InferenceData objects are completely mutable. Moreover there are some helper functions such as idata.map to apply the same function to multiple groups such as observed_data, posterior_predictive and predictions at the same time…
like the above but using the transform argument of ArviZ plots.
use only destandardized data in the model and standardize within the model code (either storing the destandardized versions with deterministic or not, it’s probably not necessary in most cases). I guess that in most cases the standardization cost is negligible compared to the rest of the model

And focused on predictions/posterior_predictive sampling, there is also the option to create a custom sample posterior predictive with xarray/numpy/dask… This option is extremely flexible but generally requires more work than the other alternatives, so I would only recommend it if there is an extra need for this such as a bug in PyMC sample posterior predictive for a required distribution or having to predict a large amout of data using dask because it does not fit in memory…

mattiasthalen · October 15, 2020, 2:31pm

To spin off on standardization.
What throws me off is how to scale the prior for the error.
My unstandardized priors are

intercept = pm.Normal(100, 10)
slope = pm.Normal(-100, 10)
error = pm.HalfNormal(5)

My initial thought was to scale the error like this:

5 / y_obs.std()

But that result in estimations that are a bit too narrow. And 1 is waaaay to wide.

Topic		Replies	Views
Centering data using pm.Deterministic	8	66	February 15, 2025
Posterior distribution of estimated parameter has a lot of variation version agnostic modeling , sampling	2	378	March 24, 2023
Estimation of scale parameter of standardized outcome in linear regression Questions	0	487	July 12, 2019
Non-normalized data in a GLM how to handle? v5 modeling	1	428	August 28, 2023
Generic way to "normalize" inference model Questions	7	1499	May 19, 2022

Where to standardize data

Related topics