Where to standardize data

If I want to standardize data, i.e. (values - values.mean())/values.std(), where should I do that, inside or outside the model?

I would love to be able to predict on new values and plot them using az.plot_posterior() and having the non-standardized estimations shown. Is the only way to use pm.Deterministic do calculate the non-standardized parameters and the non-standardized estimated value? If so, how do I plot that estimated value in az.plot_posterior()?

There are many alternatives, each with different strengths and focuses, personally I am not tied to any and I am not really sure what to recommend either so here goes a quick overview of the alternatives:

  • using pm.Deterministics
  • using only standarized values in the model and after inference/forward sampling destandarize the results of the inferencedata object.
    Here you can create new variables on the dataset or directly modify the present variables, InferenceData objects are completely mutable. Moreover there are some helper functions such as idata.map to apply the same function to multiple groups such as observed_data, posterior_predictive and predictions at the same time…
  • like the above but using the transform argument of ArviZ plots.
  • use only destandardized data in the model and standardize within the model code (either storing the destandardized versions with deterministic or not, it’s probably not necessary in most cases). I guess that in most cases the standardization cost is negligible compared to the rest of the model

And focused on predictions/posterior_predictive sampling, there is also the option to create a custom sample posterior predictive with xarray/numpy/dask… This option is extremely flexible but generally requires more work than the other alternatives, so I would only recommend it if there is an extra need for this such as a bug in PyMC sample posterior predictive for a required distribution or having to predict a large amout of data using dask because it does not fit in memory…

2 Likes

To spin off on standardization.
What throws me off is how to scale the prior for the error.
My unstandardized priors are

intercept = pm.Normal(100, 10)
slope = pm.Normal(-100, 10)
error = pm.HalfNormal(5)

My initial thought was to scale the error like this:

5 / y_obs.std()

But that result in estimations that are a bit too narrow. And 1 is waaaay to wide.