Question about how to force y_pred to positive numbers and advice seeking on my exercises

Hello,

I am learning Bayesian Analysis with Python 2nd edition. And I did an exercise in linear model regression by using a tool dataset (please see the attached .csv –
insurance.csv (53.0 KB)
). I tried several models on this dataset.

I printed my jupyter notebook with the exercise models I tried and my self-notes of my thoughts during the exercise, the weird results and errors I got and my questions –
Insurance_Linear_Model_PZ-20230513.pdf (3.2 MB)

Could you please help me point out the mistakes I made during my exercise, and maybe also help me answer why I get weird results and sampling errors in some models?

And I want to ask how can I force my y_pred to positive numbers? Any examples I can learn from?

Thank you very much for helping me with this. Greatly appreciate it :blush:

Whilst I’ve not read Bayesian Analysis with Python yet, I’m trying to get to grips with some of the same questions you are. So in lieu of a reply from someone more qualified (hopefully Osvaldo himself!) here are some thoughts I had while looking through the code. Hopefully someone will correct any errors I make and we can both learn something!

General observations

  • The code and model look reasonable to me. You should consider updating to PyMC V5 though, if you can. There are loads of improvements and PyMC3 is getting on a bit now!
  • Defining model coords and using pm.Model(coords=coords) is really useful and has nice side effects such as friendlier names in diagnostic plots.
  • I like to keep my data in pandas when adding it to the model. Being able to reference columns by name, rather than index, makes it easier for me to remember which feature is which. I think it’s also easier for newcomers to the code to be able to understand it.

Simple / arbitrary model

  • I’m not sure why you’ve used the minus sign in a_tmp - pm.math.dot(x_mean, beta). I assume it’s deliberate but noting just in case not.
  • It is possible that the discrepancy at higher values of the posterior predictive distribution are being affected by outliers, depending on what you consider to be an outlier. However, the charges variable is very heavily right skewed, even after conditioning on age and BMI (at least from my quick checks). An easy first step would be to use a Student T distribution for the likelihood for this specific issue, which I see you tried later. One issue with that approach is it puts more mass on negative values of charges.
  • Indeed, negative charges are a consequence of using a likelihood distribution with support over negative values. Now, I’m not sure what the data is about or what you’re trying to achieve but you could ‘justify’ this by interpreting these negative charges as indicating that these cases are highly unworthy of being charged at all – so much so that in a hypothetical world they would be paid for being so health! You could back this up by claiming that the normal distribution is appropriate, because the mean of the prediction is comprised of the sum of many small contributions from the samples. Basically going down the route of accepting the model doesn’t represent the generative process in reality but it makes mathematical sense and is (or could be) useful anyway. An alternative would be to force the mean of the likelihood, mu, to always be positive. The most common way of doing this is by taking the exponent of the mean, which again I see you’ve already tried. It does makes interpretability and setting priors a little bit harder. More importantly, it only forces the mean of the likelihood to be positive, so negative values are still allowed (and common) because the charges are represented by an entire distribution. To ensure the posterior is entirely positive, you would need to either use a likelihood distribution with support only over positive values or discard negative samples from the posterior. I have seen the latter recommended but I’m not at all sure that’s a principled solution. You could combined the robust regression and truncated likelihood with a HalfStudentT likelihood – the charges approach zero so the lower bound wouldn’t be far off. (Edit: I see you also tried a StudentT.) Alternatively there’s the TruncatedNormal. I’m not sure how well that would work in terms of sampling or how hard it would be to argue for as a principled model design. I think it would be worth a try though, at least for purely prediction purposes.

Confounding
Again, not sure I can offer much on this, especially as I’m not familiar with the data, but I’ll offer some thoughts anyway! Assuming charges are related to health status, age and BMI are likely to be correlated. Metabolisms slow down as we get older so it’s plausible that BMI increases with age (starting to decline above a certain age). If charges can be considered a proxy for health outcomes, then clearly health declines with both increased BMI and increased age. When age is excluded, the model assigns the effect of age to BMI – BMI becomes the only parameter available to account for the variance in the data. If, in fact, age is the heavily dominant effect, when age and BMI are included, BMI has reduced ability to explain the variance in charges, because it was mostly acting as a proxy for age. So BMI’s effect drops whilst age retains a similar strength of effect as to when it was the only explanatory variable – in reality it was doing the work all along, just via BMI when that was the only route available.

Divergences
HMC uses the gradient of the posterior to inform the next step size for each sample. It’s a little bit similar to gradient decent optimisers with momentum. Sometimes the geometry of the posterior can be difficult for NUTS to estimate a good step size (high curvature) and it ends up in a location where the state of the sample (position and energy) don’t make sense. I suspect these are cause by a combination of the StudentT (it helps with the right skewed data but also has more mass at negative values, for which there’s no data) and the wider priors, which result in a) more negative posterior means (again where there’s no data) and b) positive values which are greater than the observed target.

2 Likes

Thank you so so much for this very comprehensive and interesting reply :rose: :rose: :rose:! And sure, I have had pymc5 installed, and I’ll start with redoing my exercise with pymc5 later this week. Then I’ll try HMC as the sampler to see if I can have something. I’ll share the findings of my trials. Tons of thanks again :sunny:!