Question about General Modeling Techniques

Is there any reason to “not” transform the predicted variable (y) into it’s log form? I’m trying to develop a time series model of Poisson distributed data and it seems to fit better when I transform the y variable to it’s log form and fit it with a Normal Distribution.

But I’m not sure if there are risks doing that…are there risks?

it seems to fit better

How is that being assessed? Have you ruled out that the data may be over- or under-dispersed? That is, the “betterness” could simply be that the variance is modeled by an additional parameter. Gamma-Poisson is a more typical model for such a case, staying within the discrete support while modeling dispersion.

I am assessing the fit by looking at the predictive posterior samples in addition to the plot trace and summary stats…

Posterior Predictive
~Normal Distro w/ Log(Y)

-Poisson Distro w/ Non-transformed Y

-Normal DIstro w/ Non Transformed Y

Forgive my naivete but how do I check for over/under dispersion? I assume this means the variance but correct me if I’m wrong.

Thank you for the tidbit on gamma-poisson. Is there a tutorial or example to demonstrate this?

1 Like

That looks overdispersed - and yes, more variance in the data (after conditioning on covariates) than the mean of Poisson can support. The Gamma-Poisson is also known as Negative Binomial. There is a regression example in this notebook.

Transformation of the observation variable is super common, especially in the time series domain (where I think you are working). Basically every analysis starts with scaling or de-trending one way or another (you can’t even start doing classical time series analysis until variables have been transformed to be stationary!). Statistically there’s no danger to doing transformations – modeling one way is as valid as modeling another. Just make sure you keep track of all the transformations your do, and make sure you undo them in the right order once it comes time to check your predictions.

To the extent there is danger, it comes from 1) how to interpret your results, and 2) how to communicate those results to stake holders, who might not have any statistics training.

Problem 1 isn’t so bad, especially if all your transformations are reversible.

Problem 2, you have to be careful and think about what the key points of the analysis are. For example, transforming your sales into logs and using a normal distribution means the coefficients of your model are semi-elasticities (i.e. a 1 unit change in X leads to a \beta% change in the predicted mean of y).

If you instead model the rate parameter of a Poisson, actually the interpretation is similar because you end up taking logs anyway, but it’s common to talk about “rates” in this context, i.e. the rate of sales increase by \beta percent given a unit increase in X (since you will model \lambda = \exp(f(X)), so \mathbb{E}[\log(y_t)] = \lambda = f(X) )