What does the hierarchical model look like when having missing in observed?

yangwufl · October 25, 2018, 5:28pm

Hi,

I wonder what would be the hierarchical model given having missing in observed? I am assuming pymc3 will automatically assign a prior (not sure what it looks like) to missing values but I can not work out the full model. Would the likelihood be p(non-missing | missing, \theta)? However, in the model I am supposed to set it to be p(non-missing, missing | \theta). Please advise.

junpenglao · October 25, 2018, 5:41pm

Basically you get p(non-missing | \theta) and p(missing | \theta_posterior) = p(missing | \theta, non-missing). The later case is added to the model as a free RV (so you actually see it in the trace, think of it as more similar to a posterior sample).

yangwufl · October 25, 2018, 5:55pm

Thanks for prompt response. I am not sure if I am following. As a concrete example, say \mu ~ N(0,1), observed Y | \mu ~ N(\mu, 1). Inference would be done based on p(\mu | Y) while Y has no missing.

If Y has missing (denote missing part as Y_m and non-missing as Y_nm. Would pymc3 do p(\mu | Y_nm) and p(Y_m | Y_nm)? And how to understand the likelihood (Y_m, Y_nm) | \mu ~ N(\mu, 1)? Don’t we need Y_nm | Y_m, \mu instead as likelihood?

junpenglao · October 26, 2018, 6:27pm

It is the same when missing value is presented. Internally, PyMC3 isolate the Y_m and Y_nm, and do p(\mu | Y_nm). The Y_m is internally added to the model as a free random variable, and Y_m ~ Y | \mu, Y_nm. In the above example, Y_m | \mu_posterior ~ N(\mu_posterior, 1) where , \mu_posterior ~ p(\mu | Y_nm).

yangwufl · October 26, 2018, 8:32pm

Thanks for the sharing the detail! I understand now how it works when Y is 1-dimensional. The problem I am working on having Y as multi-dimensional (think of Y as multi-variate normal) and for each row of Y, it could be missing for just some columns or the entire row. So would pymc3 use all the non-missing-at-all rows to infer \mu then sample missing Y’s or incorporate partially missing rows as well? If so, could you comment on how it works briefly under this framework? Would there be any conditional distributions estimated as well in case which I would image it’d be quite computationally pricy?

junpenglao · October 26, 2018, 10:52pm

In multivariate case this won’t work, currently the missing value feature in pymc3 only handles 1d case. You can have a look at some related discussion here: OOS predictions with missing input values

yangwufl · October 27, 2018, 1:21am

Will take a look. Interesting story is it works for multivariate case based on a simulation that I did, which makes me wonder how it works in the first place. It was a simple set up where data is from multivariate normal with zero mean and cov matrix 3x3 (manually set as long as it’s pd). I masked each column 30% nan randomly and turned out it not only recovered the cov matrix pretty well (close to true), but delivered better imputation than using col means. The likelihood I used is:
out = pm.MvNormal(‘out’, mu=0, chol=chol, observed=masked_z)
where chol has LKJ prior and I was using ADVI to do inference. Any idea?

junpenglao · October 27, 2018, 2:56pm

You can check the shape of the trace, my guess is that any row containing missing value is excluded. So in effect the parameters are estimated using rows with no missing data.

yangwufl · October 27, 2018, 4:52pm

I checked the shape of trace for missing values and it matched with #missing in data.

To figure out which rows the algo was using, I did another simulation where instead of random missing, I purposely set each row having one missing value. Turned out still the algo was able to recover the cov matrix and at the same time generate dist. for missing values which was better than trivial imputation. In other words I think the algo is smart to use partially missing rows to do inference but I am not sure how it did it. Couldn’t figure out a generative process from it.

Another observation is the corr estimation was better for pair of columns where co-non-missing rate is high, which makes me wonder did the algo use pairwise marginal distribution to do inference instead of the joint?

junpenglao · October 28, 2018, 1:51am

Interesting - I am surprised that it works so well. Could you please share your code and (stimulation) data?

yangwufl · October 31, 2018, 7:44pm

Hi Junpeng, sorry for the late response. I have some difficulties sharing my code and hope it would work for me to describe what I did.

The model itself is following document: https://docs.pymc.io/notebooks/LKJ.html (same hyper-parameters) assuming data is from multivariate normal distribution with dims = (10000,3), where my mu = [1,10,-5], cov=[[1,0.1,0.2],[0.1,1,0.9],[0.2,0.9,1]]. Data was then masked with np.nan each column with missing rate = 0.3. I used ADVI as inference. It only took minutes to get the trace and interesting things started happening from here - not only well recovered cholesky matrix but give good predictions on missing

junpenglao · October 31, 2018, 8:07pm

Oh you are running ADVI - in that case I am not sure how the missing value is handle but in generally I think only the unmasked it considered. I will have another look and get back to you.

yangwufl · October 31, 2018, 8:16pm

Thanks! One more comment that I also did another simulation to address your point here that I set missing rate = 0.4 and made sure each row has exactly one missing value (so no non-missing rows), which also works

Topic		Replies	Views
Two likelihoods dependent on missing predictors Questions	2	486	August 22, 2019
Not Understanding the Posterior Questions	2	425	December 31, 2020
Posterior Predictive Sampling -- Works for Hierarchical Model?	3	2155	June 12, 2019
Feasibility of following Hierarchical model Questions	2	469	July 31, 2020
Is there any material gently explain how PYMC3 class/function works Questions	2	635	June 29, 2019

What does the hierarchical model look like when having missing in observed?

Related topics