Generic way to "normalize" inference model

Hi,
I have previously struggled with inference on tiny observed values (Inference on tiny values), and I have now come to a point where the “hacks” of generating a centered distribution no longer seems feasible.

Is there a proposed way to normalize observed data and accompanying priors to allow sampling in the range 0-100 where the sampler “works”?

I’d say the two most common scaling techniques are standard scaling (“z-score” scaling), and min-max scaling. Both are implemented in sklearn.preprocessing to make transformation and inverse transformation easy. Calling the observed data y_i and transformed data \tilde y_i, they are:

Standard:
\tilde y_i = \frac{y_i - \mu_y}{\sigma_y}

Min-max:
\tilde y_i = \frac{y_i - \min(y)}{\max(y) - \min(y)}

Using the standard scalar, your data will have a mean of 0 and a standard deviation of 1. This makes choosing priors extremely easy: choose a mean of 0 and a standard deviation of 1. Each point becomes a measure of it’s “extremeness”, the number of sample standard deviations from the sample mean.

The min-max scalar will transform your data to be between 0 and 1. It preserves relative distances between data points. If you wanted to insist on a certain range, for example 0-100, you could multiply the scaled values by 100.

Usage of the sklearn objects is something like this:

from sklearn.preprocessing import StandardScalar
scalar = StandardScaler()

# I assume y is 1d; you need to make it a column vector before giving it to the scalar
y_tilde = scalar.fit_transform(y[:, None]).ravel()

# Do modeling, obtain posterior estimates y_hat_tilde

y_hat = scalar.inverse_transform(y_hat_tilde)

You can then do inference as normal on y_hat, which will be in the same units as the original data.

1 Like

Thank you so much for getting back to me!
is this something that would work for all kinds of distributions?

You should be careful to distinguish between your data (the things you observe) and the distribution, which is a fake thing you use to describe how the data might have been created.

The scalars operate on any data, no sweat. But the results might change what distributions you could choose to model the data. One example might be count data, so you have y \in \mathbb Z^+, which leads you choose a Poisson or Negative Binomial distribution as your model, since those spit out positive integers. If you standard scale it, you will end up with \tilde y \in \mathbb R, because after subtracting the mean half of your observations will be negative, and after dividing by the standard deviation none will be integers. So these distributions are no longer appropriate; something like a Normal would make more sense.

On the other hand, if you were to MinMax scale, a Normal might not be right either because your observations are constrained to 0-1. In that case you reach for something else.

Thank you once again for the time to get back to me.

I clearly see how standard scaling may result in count data no longer being purely on the positive axis. I am however having a hard time accepting that such a scaling also changes the distribution one would use to fit the data. An example:

If lets say I have data i believe is exponentially distributed. In order to help the sampler, i do a standard scaling. My data now lives on both the negative and positive line. I am no longer able to choose a exponential distribution to fit my data, but they are exponentially distributed!? chosing anything would decrease my model fit?

if i think my data is normally distributed, I guess the scaled model is still possible to fit with the “right” distribution. However, if I have prior beliefs i want to incorporate in a typical pymc3 model (as a prior), how do i assure that the priors get scaled accordingly? I would have to scale the prior as well, but keep the “relative error” in order for the inference to be accurate?

Something that helped me a lot was to think about abstract spaces, the things that live there, and ways things move from one abstract space to another. Functions are maps between abstract spaces. When you apply a function, you leave one space and go to another, where the rules might be totally different.

Everything you say in your post is correct in the data space. That’s where the process you are interested in studying lives. It makes the most sense. It’s nice! But when you apply a scalar, you leave the data space. Up is down! Cats lie with dogs! The rules of the data space don’t apply anymore, because you’re not in the data space anymore.

Your data are still exponentially distributed in the data space. But you aren’t there anymore, and they aren’t exponentially distributed in the scaled space. In the scaled space, they are some flavor of symmetric distributions: normal, student t, cauchy, or the like. If you dogmatically try to model with data space rules in the transformed scaled space, you will get a bad fit. Think about it: an exponential distribution assigns zero probability to negative numbers. Sticking to this means confidently declaring that half of the data, which you currently hold in your scaled hands, is totally impossible!

If you plan to apply a scalar to your data, you need to do all your EDA in the scaled space. This will help you build intuition about the kinds of priors you need to use, and the type of likelihood function you will eventually choose for your model. You might also try scalar the priors you already had in mind, and see where they end up. It’s a nice idea.

One final note: two scalars I proposed (as well as all other widely used scaling functions) are bijiective. This means you bought round-trip tickets when you left data space for scaled space. All the scaling can be undone. When you do your final analysis, you can return to to data space and check that everything makes sense. This is how you will ensure that your inference is accurate.

Hi, thank you for being so patient with me an giving great explanations and examples!
So if I was to keep the distribution type, but I want to keep my data to “sensible magntitudes” to help the sampler, typicaly in the scale of -100 to 100, could I do something like scaling observed values by the mean of the observations, and then scale the intutive priors accordingly?

I am trying to find a decent way to get around the changing of distribution type to fit my data if i believe it is i.e. exponentially distributed. I think I will loose a lot of intuition by doing this…

Hi, I have a follow up question.
The first post in this thread reffered to inference on tiny values, failing with a “division by zero”-error.
I have only seen the division by zero issue on distributions that cannot be auto-transformed to log-space (Probability Distributions in PyMC3 — PyMC3 3.11.5 documentation), i.e. Normal or Student-t distributions.

Am I correct to assume that this error will not occur for distributions on the positive axis, and that any scaling will be crucial only for the distributions also on the negative axis?

Thanks again