If I want to standardize data, i.e. (values - values.mean())/values.std()
, where should I do that, inside or outside the model?
I would love to be able to predict on new values and plot them using az.plot_posterior()
and having the non-standardized estimations shown. Is the only way to use pm.Deterministic
do calculate the non-standardized parameters and the non-standardized estimated value? If so, how do I plot that estimated value in az.plot_posterior()
?
There are many alternatives, each with different strengths and focuses, personally I am not tied to any and I am not really sure what to recommend either so here goes a quick overview of the alternatives:
- using
pm.Deterministics
- using only standarized values in the model and after inference/forward sampling destandarize the results of the inferencedata object.
Here you can create new variables on the dataset or directly modify the present variables, InferenceData objects are completely mutable. Moreover there are some helper functions such as idata.map
to apply the same function to multiple groups such as observed_data, posterior_predictive and predictions at the same time…
- like the above but using the
transform
argument of ArviZ plots.
- use only destandardized data in the model and standardize within the model code (either storing the destandardized versions with deterministic or not, it’s probably not necessary in most cases). I guess that in most cases the standardization cost is negligible compared to the rest of the model
And focused on predictions/posterior_predictive sampling, there is also the option to create a custom sample posterior predictive with xarray/numpy/dask… This option is extremely flexible but generally requires more work than the other alternatives, so I would only recommend it if there is an extra need for this such as a bug in PyMC sample posterior predictive for a required distribution or having to predict a large amout of data using dask because it does not fit in memory…
2 Likes
To spin off on standardization.
What throws me off is how to scale the prior for the error.
My unstandardized priors are
intercept = pm.Normal(100, 10)
slope = pm.Normal(-100, 10)
error = pm.HalfNormal(5)
My initial thought was to scale the error like this:
5 / y_obs.std()
But that result in estimations that are a bit too narrow. And 1 is waaaay to wide.