[quick conceptual question] shouldn't the lognormal distribution be used as the likelihood more often?

Having been reading over the German Tank problem as an example of a serious statistical question which can be nicely answered using Bayesian methods (i.e., Pymc3), I was thinking about how intuitive it was to constrain the DiscreteUniform likelihood to be at least =1, because we know that the Germans made at least 1 tank during their war effort, when all of a sudden the though struck me, why should not the same assumption be baked into most regressions utilizing the Normal distribution?

For example, sometimes we want to estimate the price of something, which can never be negative, and many other examples of practical regression share this same exact restriction on the dependent variable, so shouldn’t, as a quick alternative, the log-normal distribution be used in most cases of regression?

I apologize if this question is incredibly naive or betrays some huge statistical misunderstanding; at least I want to find out if my newfound reasoning here checks out, and I would appreciate any comments on it.

Are you thinking just about bounds, and so would a pm.HalfNormal (which is an efficient form of pm.Bound(pm.Normal, lower=0.0)) satisfy your argument just as well?

Typically, I think some sort of generative argument (a Student’s t-distribution being a sum of gaussians, or the Gumbel being a maximum of exponentials) would hold more sway, and the bounds would be imposed after selecting the distribution.

While I do not dispute the value of generative arguments for which distributions to use, isn’t there also value to specifying in a model that you know that the dependent/output variable is never negative?

Maybe I should have asked, in general, what is the correct way to specify that the dependent variable lies on the positive real number line?

1 Like

Oof, sorry! Should have opened with: yes! lots of value to specifying that, and it is a great idea to do!

pm.Bound(...) will actually return an honest distribution that can have observations, so if you are modelling how much of your $100 you spend at a park with a $5 entrance fee,

money_dist = pm.Bound(pm.Normal, lower=5.0, upper=100.0)
money_spent = money_dist('money_spent', 50, 30)

might be a good choice.

Huh, I was always under the impression (from random posts around these forums) that the Bounded distributions were only temporary measures, and not to be trusted as full-fledged statistical distributions. This is certainly news to me if that is absolutely false.

On that note, however: if one were to choose some positive-only distribution as the predicted variable, would you recommend anything on figuring out which to pick? I mentioned the log-normal distribution as a natural choice because of its simple shape, but pymc3’s distributions page also lists the Gamma, Weibull, and ChiSquared which fulfill this requirement.

Those two reasons you mentioned for the Student’s t and the Gumbel are so succinct. I never heard of them before, despite having thought I was familiar with them ^^;

The main different is that pm.Bound does not normalize the density function which means the resulting pdf does not integrate to 1 any more. It is not an issue if you are using a fixed constant as bound, as sampler can work with non-normalized density without problem. However, there could be some problem if you are using another random variable as bound - for those use case a truncated model is more appropriate.

2 Likes