I tried to take inspiration from the example given [HERE: Linear regression] to analyze a regression with a small set of personal data.
The regression line resulting from this analysis does not correspond at all to what one might expect…
I wonder what is wrong in the written program, which does not look really different from the program given in the example? Unless mandatory specifics were implied (but not precisely written) in the given example?
I think this is a scale problem. Your Date variable takes values that are quite large, in the thousands. So if you think about drawing a line back from your data points (around 30) to the x-intercept, it will take a while to get there. What I’m trying to say is that the intercept is likely to be a large negative number. Computing the OLS coefficients can be instructive:
So the intercept is -300, which according to your prior has a probability very close to zero (0.0008 i believe). Since you’re ruling out intercepts that would allow slopes consistent with your data, the model compromises by abandoning the slope and increasing the uncertainty around a flat intercept.
You can solve this by one of two ways. First, you could crank up the sigma on the intercept prior so that it allows values like -300, or even switch it to something fat-tailed, like a Cauchy. This works fine, but a more general solution (and the recommended one by pretty much everyone) is to scale your inputs. Instead of fitting height ~ a + b * date, fit height ~ a + b * z_date, where z_date = (df.Date - df.Date.mean()) / df.Date.std() . By normalizing your inputs to have zero mean and unit variance, it makes it much easier to reason about priors. It also helps the sampler cruise along in many cases. See this discussion, for example.
By the way, this is where a prior predictive check would have been helpful! Looking at what your priors imply about your modeling assumptions relative to the data you are fitting can reveal problems like this.
Viewed this way, it’s clear that the model is biased towards models with relatively flat slopes. We can also look at the y-axis scale and decide if it makes sense given domain knowledge about the object being studied. I don’t know what your data represents, but if we were talking about human height for example, I would be unhappy with that fact that I’m generating models with a mean of plus or minus 20,000 with high probability.
Thank you for this very informative example, which will surely help me well, later, in other decisions to be made.
In fact, “Date” is date of construction and “Height” is a height of cathedrals, in a certain French department… But that doesn’t change the problem; negative pitches are nonsense, as you perfectly expressed. It is I who am not very efficient in my attempts at analysis.