Could Someone Give me Advice for Handling Missing Data in Bayesian Modeling with PyMC?

Hello there, :wave:

I am new to Bayesian modeling and have been experimenting with PyMC for a few months now. First of all; I would like to say thank you to everyone here for the wealth of knowledge available on this forum it is been invaluable in getting me started!

I am working on a dataset with a significant amount of missing values, and I am unsure how to approach the problem in the context of Bayesian modeling. The dataset involves a mix of continuous and categorical variables; and the missing data appears to follow some non random patterns.

I have read that PyMC allows for the modeling of missing data directly as part of the Bayesian inference process. Can anyone share examples or resources that show how this is implemented effectively, particularly for MNAR scenarios? :thinking:

Would you recommend using PyMC to model the missing data directly or preprocessing the data with imputation techniques like multiple imputation before using it in the Bayesian model?

Also, I have gone through this post; https://discourse.pymc.io/t/dealing-with-missing-data-and-custom-distribution-salesforce-commerce-cloud which definitely helped me out a lot.

Are there any common pitfalls to avoid when handling missing data in PyMC? For example, I have noticed that including too many predictors for imputation can sometimes slow down the convergence.

Thanks in advance for your help and assistance. :innocent:

1 Like

Hi roberttt,

I saw your post hasn’t been answered for a while, so I thought I’d try to give you some thoughts! I think your question is very general, and I’m not sure there is a very good specific answer (but please someone jump in if they feel otherwise). I guess my (perhaps not very helpful) feelings are that you can use PyMC to implement pretty much any missing data procedure. As you already pointed out, one thing you could do is to perform full Bayesian inference on your missing values, estimating a posterior distribution for them. Or you can do a two-step procedure like multiple imputation, which you also mentioned. I don’t think there’s necessarily “one procedure” that’s always better; it probably depends on how big your dataset is and how you’d like to model the missingness mechanism.

It’s been a while since I’ve worked with tricky missing data problems myself, but I do remember that the book by Rubin and Little was both helpful and had quite a few Bayesian examples, which I’m sure you could implement in PyMC. I think there is also a good chapter in Bayesian Data Analysis on how to handle missing data.

Sorry that I don’t have a more specific answer than this, but I hope at least that this points you in a useful direction.

2 Likes