What does the hierarchical model look like when having missing in observed?



I wonder what would be the hierarchical model given having missing in observed? I am assuming pymc3 will automatically assign a prior (not sure what it looks like) to missing values but I can not work out the full model. Would the likelihood be p(non-missing | missing, \theta)? However, in the model I am supposed to set it to be p(non-missing, missing | \theta). Please advise.


Basically you get p(non-missing | \theta) and p(missing | \theta_posterior) = p(missing | \theta, non-missing). The later case is added to the model as a free RV (so you actually see it in the trace, think of it as more similar to a posterior sample).


Thanks for prompt response. I am not sure if I am following. As a concrete example, say \mu ~ N(0,1), observed Y | \mu ~ N(\mu, 1). Inference would be done based on p(\mu | Y) while Y has no missing.

If Y has missing (denote missing part as Y_m and non-missing as Y_nm. Would pymc3 do p(\mu | Y_nm) and p(Y_m | Y_nm)? And how to understand the likelihood (Y_m, Y_nm) | \mu ~ N(\mu, 1)? Don’t we need Y_nm | Y_m, \mu instead as likelihood?


It is the same when missing value is presented. Internally, PyMC3 isolate the Y_m and Y_nm, and do p(\mu | Y_nm). The Y_m is internally added to the model as a free random variable, and Y_m ~ Y | \mu, Y_nm. In the above example, Y_m | \mu_posterior ~ N(\mu_posterior, 1) where , \mu_posterior ~ p(\mu | Y_nm).


Thanks for the sharing the detail! I understand now how it works when Y is 1-dimensional. The problem I am working on having Y as multi-dimensional (think of Y as multi-variate normal) and for each row of Y, it could be missing for just some columns or the entire row. So would pymc3 use all the non-missing-at-all rows to infer \mu then sample missing Y’s or incorporate partially missing rows as well? If so, could you comment on how it works briefly under this framework? Would there be any conditional distributions estimated as well in case which I would image it’d be quite computationally pricy?


In multivariate case this won’t work, currently the missing value feature in pymc3 only handles 1d case. You can have a look at some related discussion here: OOS predictions with missing input values


Will take a look. Interesting story is it works for multivariate case based on a simulation that I did, which makes me wonder how it works in the first place. It was a simple set up where data is from multivariate normal with zero mean and cov matrix 3x3 (manually set as long as it’s pd). I masked each column 30% nan randomly and turned out it not only recovered the cov matrix pretty well (close to true), but delivered better imputation than using col means. The likelihood I used is:
out = pm.MvNormal(‘out’, mu=0, chol=chol, observed=masked_z)
where chol has LKJ prior and I was using ADVI to do inference. Any idea?


You can check the shape of the trace, my guess is that any row containing missing value is excluded. So in effect the parameters are estimated using rows with no missing data.


I checked the shape of trace for missing values and it matched with #missing in data.

To figure out which rows the algo was using, I did another simulation where instead of random missing, I purposely set each row having one missing value. Turned out still the algo was able to recover the cov matrix and at the same time generate dist. for missing values which was better than trivial imputation. In other words I think the algo is smart to use partially missing rows to do inference but I am not sure how it did it. Couldn’t figure out a generative process from it.

Another observation is the corr estimation was better for pair of columns where co-non-missing rate is high, which makes me wonder did the algo use pairwise marginal distribution to do inference instead of the joint?


Interesting - I am surprised that it works so well. Could you please share your code and (stimulation) data?


Hi Junpeng, sorry for the late response. I have some difficulties sharing my code and hope it would work for me to describe what I did.

The model itself is following document: https://docs.pymc.io/notebooks/LKJ.html (same hyper-parameters) assuming data is from multivariate normal distribution with dims = (10000,3), where my mu = [1,10,-5], cov=[[1,0.1,0.2],[0.1,1,0.9],[0.2,0.9,1]]. Data was then masked with np.nan each column with missing rate = 0.3. I used ADVI as inference. It only took minutes to get the trace and interesting things started happening from here - not only well recovered cholesky matrix but give good predictions on missing


Oh you are running ADVI - in that case I am not sure how the missing value is handle but in generally I think only the unmasked it considered. I will have another look and get back to you.


Thanks! One more comment that I also did another simulation to address your point here that I set missing rate = 0.4 and made sure each row has exactly one missing value (so no non-missing rows), which also works