I was wondering, is there any best practice for extrapolating from a regression line vs interpolation?
What I’ve basically done is to scale up sigma the further from the last observed value on the line I get. My reasoning is that interpolated values would generally be more precise, but extrapolating values is less precise.
What is the form of your regression? In many Bayesian contexts, both intercepts and slopes (coefficients) will be uncertain. In such models, extrapolation will automatically be uncertain without having to build it in explicitly. McElreath refers to this as the “bow tie” pattern.
Another thing to consider is when you extrapolate you may or may not have confidence that your model ( e.g. the model form ) still applies. Consider @cluhmann 's example but the data goes nonlinear outside of the measurements. You would be underestimating the uncertainty severely and would be delivering an answer with higher confidence than you should even with the bowtie effect. Only you can know for your problem of interest if your model will be appropriate for extrapolation and how far you would trust it outside of the dataset.
Interesting. Am I right in thinking that what you’re eluding to is that while my data suggests a linear relationship, if I were to collect the data that would correspond to the extrapolated data it might actually something else instead? E.g. the data now might look more like an exponential curve, where as my initial data suggested it was linear.
That’s exactly what I mean. This is something that a GP handles ( in some regards ) via the covariance function but you still select which covariance function you use ( i.e. Gaussian vs. Exponential ) which has implications for how rapidly, and to what extent, your uncertainty grows.
In my field ( material modeling ), the models that I construct are informed by physical laws which must be observed so, at the very least, that helps me build models which are forced to remain within the laws of physics. But if I extrapolate too far or even interpolate between very widely spaced points there is no guarantee that I’ll be predictive. If you are trying to make inferences between things that don’t observe physical laws ( or at least you aren’t enforcing them in your modeling framework ) but are trying to intuit trends or assume some form that seems appropriate given the data you observed you have to be very careful.
As a more concrete ( though kind of silly ) example:
I build a model for steel and I calibrate that model using data that I collect at room temperature. I’m probably okay being predictive in that regime but I definitely won’t be predictive on the surface of the sun!
Another silly example:
I have a perfect model that captures the behavior of steam and ice. There is no guarantee that I will be accurate in the prediction of water.
You might be able to hedge your bets some by doing something like Bayesian Model Averaging or Bayesian Model Combination both of which are subsets of Ensemble Learning. I would at least consider something like it if you are looking to extrapolate far from data or interpolate between very widely spaced clusters of points.