Hi alphamaximus,
So my dataset is 20 dimensional, with roughly 1.3 million data points. Unfortunately it’s got a lot of missing values. Of the 20 dimensions, the best is missing at 11%, the worst is missing at 85%. But even with 85% missingness, there are still roughly 195000 entries in that column - which means they may still be valuable for inference.
So my approach to handling missingness is pretty simple - maybe a bit naive, too. I simply placed a 20-dimensional Bernoulli layer on top of the unobserved “true” value of Y - so for each dimension of Y, if the Bernoulli variable is evaluated to 1, we observe the true value of Y - if it’s zero we consider it missing.
Due to my dataset having so much missingness, I thought using PyMC3’s built-in missing value mechanism may be unwise. Therefore I replaced all missing values with a -9999 placeholder.
Also, ADVI does not support inference on discrete latents, so I had to explicitly marginalize away the Bernoulli switches in the code. This results in a binary mixture with mixing proportions determined by p, where p parameterizes the Bernoulli switch.
So, with probability p we have a dirac on -9999 (that is, we don’t observe the “true” Y), and with probability 1 - p we observe it.
Hope this clarifies.
Actually looking back I should have maybe used a supervised learning approach instead.