Why does this Bayesian regression analysis fail?

I tried to take inspiration from the example given [HERE: Linear regression] to analyze a regression with a small set of personal data.

The regression line resulting from this analysis does not correspond at all to what one might expect…

I wonder what is wrong in the written program, which does not look really different from the program given in the example? Unless mandatory specifics were implied (but not precisely written) in the given example?

Here is what I used:

df = pd.DataFrame.from_dict({"Date": [1145, 1151, 1152, 1176, 1195, 1220, 1225], 
                             "Height": [22.0, 24.0, 25.0, 31.0, 34.0, 42.5, 47.0]}
                           )
with pm.Model() as mdl:

    intercept = pm.Normal("intercept", mu = 0, sigma = 20)
    slope = pm.Normal("slope", mu = 0, sigma = 20)
    sigma = pm.HalfCauchy("sigma", beta = 10)

    likelihood = pm.Normal("y", mu = intercept + slope * df.Date, sigma = sigma, observed = df.Height)
    
    idata = pm.sample(target_accept=.95)
idata.posterior["y_model"] = idata.posterior["intercept"] + idata.posterior["slope"] * xr.DataArray(df.Date)

fig, ax = plt.subplots(figsize = (6, 4))

az.plot_lm(idata = idata, y = "y", num_samples = 200, axes = ax, y_model = "y_model")

ax.set_title("Posterior predictive regression lines")
ax.set_xlabel("x")
plt.show()

The strange plot resulting is:

regression

I think this is a scale problem. Your Date variable takes values that are quite large, in the thousands. So if you think about drawing a line back from your data points (around 30) to the x-intercept, it will take a while to get there. What I’m trying to say is that the intercept is likely to be a large negative number. Computing the OLS coefficients can be instructive:

X = np.c_[np.ones(7), df.Date.values]
np.linalg.solve(X.T @ X, X.T @ df.Height)

>>>Out: np.array([-302.9709337     0.28391778])

So the intercept is -300, which according to your prior has a probability very close to zero (0.0008 i believe). Since you’re ruling out intercepts that would allow slopes consistent with your data, the model compromises by abandoning the slope and increasing the uncertainty around a flat intercept.

You can solve this by one of two ways. First, you could crank up the sigma on the intercept prior so that it allows values like -300, or even switch it to something fat-tailed, like a Cauchy. This works fine, but a more general solution (and the recommended one by pretty much everyone) is to scale your inputs. Instead of fitting height ~ a + b * date, fit height ~ a + b * z_date, where z_date = (df.Date - df.Date.mean()) / df.Date.std() . By normalizing your inputs to have zero mean and unit variance, it makes it much easier to reason about priors. It also helps the sampler cruise along in many cases. See this discussion, for example.

1 Like

Works now perfectly.
Scaling data is really something to think about.
Thank you for this redirection!

1 Like

By the way, this is where a prior predictive check would have been helpful! Looking at what your priors imply about your modeling assumptions relative to the data you are fitting can reveal problems like this.

1 Like

Mmmh, would you have simply done this, in order to check? (see image below)

az.plot_ppc(idata, group = “prior”, figsize = None)
plt.show()

What would you have concluded? And what steps would you have taken? Just spontaneously re-scale the data? Or modify prior or something else?

I ask these questions because I am really very interested to know how someone who is used to PyMC would have reacted, faced with such data…

When I do prior predictive checks of regression models I sample and plot regression lines. For example:

with pm.Model() as mdl:

    intercept = pm.Normal("intercept", mu = 0, sigma = 20)
    slope = pm.Normal("slope", mu = 0, sigma = 20)
    sigma = pm.HalfCauchy("sigma", beta = 10)

    likelihood = pm.Normal("y", mu = intercept + slope * df.Date, sigma = sigma, observed = df.Height)
    prior = pm.sample_prior_predictive()
    
prior.prior["y_model"] = prior.prior["intercept"] + prior.prior["slope"] * xr.DataArray(df.Date)
fig, ax = plt.subplots()
ax.plot(df.Date, prior.prior['y_model'].values[0].T, color='0.5', alpha=0.25);

Generates the following plot:

image

Viewed this way, it’s clear that the model is biased towards models with relatively flat slopes. We can also look at the y-axis scale and decide if it makes sense given domain knowledge about the object being studied. I don’t know what your data represents, but if we were talking about human height for example, I would be unhappy with that fact that I’m generating models with a mean of plus or minus 20,000 with high probability.

1 Like

Thank you for this very informative example, which will surely help me well, later, in other decisions to be made.
In fact, “Date” is date of construction and “Height” is a height of cathedrals, in a certain French department… But that doesn’t change the problem; negative pitches are nonsense, as you perfectly expressed. It is I who am not very efficient in my attempts at analysis.

1 Like