Paramter interpretation with scaled data


I am trying to build a simple linear model and need to make an inference based on the fitted parameters.

First off, my y data is very right-skewed, so I need to transform y.

So, I am trying to use the boxcox transformation inside a pre-defined function such as:

def fn(a,b,c,x, lam):
    y_hat = a + b*x + x**c
    y_hat = (y_hat**lam-1)/lam
    return y_hat

With this function, I am making a model like:

with pm.Model() as model:
y_obs = boxcox(y, lam) #use scipy lib to transform my observed y value.

a = pm.HalfNormal('a', sigma=100)
b = pm.HalfNormal('b', sigma=100)
c = pm.HalfNormal('c', sigma=100)

sigma = pm.Exponential("λ", 10)
bc_mu = fn(a,b,c,x,lam)

pm.Normal("obs", mu=bc_mu, sigma=sigma, observed=y_obs) 

My questions are)

  1. Would fn(a,b,c,x, lam) really work? Because I did not use any pm. math in the fn function, I am worried the (y_hat**lam -1) /lam would not work properly. I feel that I need to use functions from pm.math to do this work properly?

  2. Is this a normal way to build a model with skewed y data? Any other recommendations?

  3. If my x is negative so it cannot be transformed with boxcox, do you recommend scaling x data to positive values?

  4. I need to interpret the parameter a,b, and c with real space x data (I need a none-scaled a, b, and c parameters). But if I scale x data, the a,b, and c parameters would be also scaled with the current model setting. In this way, how can I build a model for better inference?

  1. This should be fine, all basic python math operations work on random variables without problems.

  2. Box-cox transformations are a common way to “normalize” right-skewed data, yes. Another way would be to explicitly model the skew, with something like a SkewNormal distribution

  3. If you have negative data you can use a modified Box-Cox, defined as: (\text{sign}(y_t)|y_t|^\lambda-1)/\lambda

  4. The trade-off between transformation and interpretability is common. You can sometimes recover parameter values for non-scaled data by doing some algebra. No matter what you do, things aren’t going to be as nice as “regular” OLS with normally distributed errors once you leave those green pastures.

Thank you for the detailed explanation.
What is the difference between using pm.math and basic python math then?

For example:

def fn(a,b,c,x):
    y_hat = a + b*x + x**c
    y_hat = np.exp(y_hat)
    return y_hat

y_hat = fn(a,b,c,x)


def fn(a,b,c,x):
    y_hat = a + b*x + x**c
    return y_hat

y_hat = fn(a,b,c,x)
y_hat = pm.math.exp(y_hat)

Would result in any differences?

On of the devs can correct me if I’m wrong, but under the hood PyMC uses a package called Aesara to handle all the math stuff. Aesara has its own versions of all the usual math functions. pm.math points to those functions, but Aesara is smart enough to swap out most “routine” numpy functions like np.exp, np.log, and whatnot with the Aesara versions.

Despite this, using pm.math is best practice because you’ll never be surprised by odd corner cases.

Thank you for the clarification!