Polynomial regression or logarithm of predictor

I’m using the Howell1 data set from the Statistical Rethinking book. The data has 544 values of people’s heights and weights.
image

I built a bivariate linear regression with the weight as the predictor of height, and I used the natural logarithm of the predictor:

with pm.Model() as m_4:
    α = pm.Normal('α', 178, 100)
    β = pm.Normal('β', 0, 100)
    σ = pm.Uniform('σ', 0, 50)
    μ = pm.Deterministic('μ', α + β * np.log(df_all.weight))
    
    heights = pm.Normal('heights', mu=μ, sd=σ, observed=df_all.height)
    trace_m_4 = pm.sample(5000, tune=3000, njobs=1)

Resulting in:

Or the same data, just visualized differently when I use the regular weights on the x axis:

This is visually very appealing. However, another way to reach a similar curve in the regression line was through the polynomial regression (second-order polynomial):

d.weight_std2 = d.weight_std**2

with pm.Model() as m_4_5:
    alpha = pm.Normal('alpha', mu=178, sd=100)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    sigma = pm.Uniform('sigma', lower=0, upper=50)
    mu = pm.Deterministic('mu', alpha + beta[0] * d.weight_std + beta[1] * d.weight_std2)
    height = pm.Normal('height', mu=mu, sd=sigma, observed=d.height)
    trace_4_5 = pm.sample(1000, tune=1000)

Resulting in:

Is there a rule of thumb about when to use the polynomial and when to use the logarithm of the predictor? (They both appear likely to overfit the data when sample size is small)

Ideally you have some theory to support using one model over the other, otherwise, you can perform model comparison using what is available in pm.compare.

Note that on biological measurements of body extent, the relationships are often exponential (see Allometry - Wikipedia). If you’d ask me for a reason, I would say (without checking the literature) that, usually, organisms are arrangements of cells that grow in three dimensions, thus growth rate measurements which are restricted to a single dimension (height) would miss two others.

From one of my rare encounters of frequentist’s statistics, I remember that there are some statistical tests that assess goodness of fit, something with a \chi^2 estimate. I do not remember whether it was this or this test. I’m not adept enough to tell whether that is implicitly done in WAIC and LOO, but I think these approaches are not exactly the same. @junpenglao, do you know if something like a goodness of fit test can be achieved in a Bayesian framework?

Anyways, another thing you would want to look at is the distribution of the residuals. Residuals can be retrieved by subtracting your prediction mean from the raw data.
To me, this sounds like a very Bayesian thing to do: as a rule of thumb, those residuals should be symmetrically distributed around the zero line, with no systematic deviation over the range of the predictor. Your second-order polynomial seems to overestimate the heights for the lightest subjects, whereas it under-estimates heights of medium-weighted subjects (arrows):

Note also that the fit might be biased by unequal sampling: since a majority of measurements appears to cluster at +1sd weight, these points will drag the fit towards them. A solution might be a weighted regression, which some software packages offer.

Finally, I found these articles by Norm MacLeod on regression a very revealing source:
PalaeoMaths 101 | The Palaeontological Association (part 1-4)
PalaeoMath: Part 3 - Regression III | The Palaeontological Association

Best,

Falk

2 Likes

I usually check with LOO first. Always love statistics that comes with its own diagnostics (ie, it tells you when the computation fail / not to be trusted)

Thank you both. You have me a lot of things I need to read on. I appreciate your help.