Problem using PyMC to estimate distribution of residuals

Hello,

I run into an interesting issue at work and got permission to post about it.

We were trying to model the distribution of residuals of an ML Regression problem. We tried many approaches with different levels of success. One of them was to use pymc.

I think that pymc gave worse results than the ones it could give. For this reason I made a reproducible example of the approach and showcase its shortcomings.

You can see more about this approach on this folder of this github repository.
Note that this was my first time using pymc.

Curious if anyone in the community considers this an interesting problem or would like to give some feedback on the approach and implementation.

Hi @Iolaum,

I looked at the notebook and must say that the approach doesn’t really make sense to me:
First you build a sklearn model to predict \mathcal{R}^6 \to \mathcal{R}, so that’s pretty standard. From y_hat - y_obs you calculate residuals, which is perfectly standard too.
If you’re interested in the distribution of residuals why don’t you do pyplot.hist(residuals) and be done with it?

Your second notebook tries to build a linear model to predict the residuals of the sklearn model as a function of the features.
Are you maybe trying to extrapolate to predict the sklearn model’s residuals for data that is outside the range already covered by the available data?

Here are a few points that might be helpful:

  • in Bayesian inference we don’t do train/test splitting. Always fit the PyMC model to all available data at the same time. The Bayesian model formulation captures uncertainty already and also extrapolates if that’s what you are interested in. To judge how good a model is you let the model predict/generate new data (posterior predictive) and compare this Predictive distribution with the data. The residuals between the data and the Predictive distribution tell you about lack of fit errors of the model.
  • Think of the “sigma” parameter of the likelihood as the measurement error. In your model this is sigma=e^mu, so you assume that your positive residuals have exponentially large measurement errors, while negative residuals are extremely accurate? You probably forgot to do a at.abs(), but even then I would advise to start with a constant sigma.
  • If you’re trying to build a model to extrapolate the error of the sklearn model, to be able to predict with uncertainty how biased the sklearn model will be on new data, then what assumptions do you have about this extrapolation? Remember that this measure is the mu variable going into the likelihood. Is a linear model a good model for this?

I hope this helps.

4 Likes

Hello,

Thank you for your reply and apologies for taking so much time to get back to you, I ve been sick last week.

Regarding your questions.

Are you maybe trying to extrapolate to predict the sklearn model’s residuals for data that is outside the range already covered by the available data?

Yes that is one of the goals. I know there are issues with generalization, but even a limited one would help.

in Bayesian inference we don’t do train/test splitting. Always fit the PyMC model to all available data at the same time. The Bayesian model formulation captures uncertainty already and also extrapolates if that’s what you are interested in. To judge how good a model is you let the model predict/generate new data (posterior predictive) and compare this Predictive distribution with the data. The residuals between the data and the Predictive distribution tell you about lack of fit errors of the model.

Hm, Interesting observation. I 'll give this some thought. However for our use case we are assuming that “test” data come without targets so we won’t be able to use them to train residuals. We do want to estimate however what the distribution of residuals is likely to be (at those test data).

Think of the “sigma” parameter of the likelihood as the measurement error. In your model this is sigma=e^mu, so you assume that your positive residuals have exponentially large measurement errors, while negative residuals are extremely accurate? You probably forgot to do a at.abs(), but even then I would advise to start with a constant sigma.

Thank you for those suggestions, I 'll give them a try. The reason I added sigma = e^mu is that I has getting errors at initialization and training (sometimes, not consistently). I wasn’t sure how to debug them but I noticed that If I used a functional expression that sigma couldn’t be estimated as negative no matter the inputs then those errors went away. I have tried many approaches to not have sigma be negative and the one in the notebook worked best. I knew there were drawbacks but even there you clarified them better than I realized, thanks.

P.S. I think I tried adding abs at sigma but there was a problem. I 'll have to check my notes, and get back to you on that.

If you’re trying to build a model to extrapolate the error of the sklearn model, to be able to predict with uncertainty how biased the sklearn model will be on new data, then what assumptions do you have about this extrapolation? Remember that this measure is the mu variable going into the likelihood. Is a linear model a good model for this?

  • The key assumption is that there is no concept drift.

  • Moreover a linear model is likely not the best, but I am using it as a starting point. Given how the dataset I am using is generated (as detailed in the first notebook) a second order polynomial could be the best option, if the model managed to learn the true relationship of the target and the data. But when I tried a second order model I got way worse results (and way bigger training times).

Overall this has been a learning exercise. I 'll try to see if I can set up the dependence of sigma in a simpler way without the exponent but also without getting errors during training. Any suggestions on this front are welcome.