How to modify variational inference PyMC codes to handle missing data

I am running the following GLM model with the outcome mean depend on a binary variable W, which is also a function of X. I utilize variational inference to estimate the model. When W is completely observed, the following codes work perfectly; However, I am working with the case where W has missing value, the PyMC VI codes no longer works by returning Infinity average loss. Theoretically, VI is able to handle missing value by treating those missing observations as local random variables/parameters, and take expectations of those r.v. in either ELBO evaluation or optimization steps. I notice that PyMC MCMC would automatically impute missing value from its distribution by default. I wonder how to write the VI algorithm using PyMC codes such that it can also handle those missing data issues without removing them.

def GLM(X, y, w, K, M):
	with pm.Model() as model:
		params_W = pm.Normal('bW', mu=0, sigma=1, shape=K)
		params_T = pm.Normal('bT', mu=0, sigma=1, shape=K)
		PARAM_T_sigma2 = pm.InverseGamma('sigma',alpha =1, beta= 1)
		Z = pm.math.dot(X, params_W)
		probW = probit_phi(Z) 
		W = pm.Bernoulli('W', p=probW, observed=w)
		Y = pm.Normal('y', mu=W * pm.math.dot(X, params_T), sigma = tt.sqrt(W * PARAM_T_1_sigma2) , shape =M, observed =y)
	return model

glm = GLM(X, Y, W, K, M)
with glm:
    glm_vi = pm.fit(method="advi")

I’d be curious to know if there’s a better way to handle this, but I have had good outcomes doing it “by hand”. That is, using the same random variables to parameterize two distributions: one with the observed data, and one with the missing data, then fusing them back together.

For your example, I might try:

# I assume w is a pandas series 
obs_idx = w.index[W.notna()]

n_missing = int(w.isna().sum()) #pymc is very strict about the dtype on size!

# Set up a likelihood using only the observed values...
W_obs = pm.Bernoulli('W_obs', p=probW, observed=w[obs_idx])

#...but sample some extra values to fill in the missing values
W_missing = pm.Bernoulli('W_missing', p=probW, size=n_missing)

# Then fuse everything together
W = at.zeros(w.shape[0])
W = at.set_subtensor(W[np.where(w.notna().values)], W_obs)
W = at.set_subtensor(W[np.where(w.isna().values)], W_missing)

If you do this with MCMC you’ll find that you need to use a composite step sampler, because the missing values (in this case) come from a discrete distribution, so that’s a bummer.

You mention just taking expectations w.r.t W and using those as the missing values. I believe that would correspond to W = at.set_subtensor(W[np.where(w.isna().values)], W_obs.mean().repeat(n_missing))? So you just fill the missing values with mean of the posterior likelihood, and dispense with the secondary distribution over the missing values (`W_missing’, in my example).

Again, very eager for feedback.

1 Like

Thanks for the proposed solution! It looks promising and I would try it!
If I do this with MCMC in PyMC, it seems that PyMC would automatically draw sample from the distribution to impute those missing values making it much easier to write the code.
The idea of taking expectations w.r.t W and using those as the missing values applies to VI, at least in theory (e.g., Blei, Kucukelbir, McAuliffe, 2017), when the variational distributions are all chosen as exponential family.

Yes, you are absolutely right that it’s handled automatically when you use MCMC. I think I ran into some bad initial energy problems that caused me to do this in an MCMC context, but to be honest I don’t fully remember.

“Marginalization” is something quite mysterious to me when going from theory to implementation, so take that idea with a grain of salt. If I’m pushing a pen it means solving some awful integral, but in a PPL it seems to just reduce to typing .mean()? It’s so easy that I reflexively assume I’m doing something wrong…

I just implemented your suggested solution!
It goes quite smooth except running into one error as the follow:
ParametrizationError: Discrete variables are not supported by VI: W_missing ~ Bernoulli

Do you have any thoughts on how to fix it? Thanks!

Looks like what it says, you can’t sample a discrete variable using ADVI.

Your options appear to be:

  1. Try implementing some kind of marginalization over the missing values, maybe by taking the mean of the observed values, or by waiting for someone more experienced to chime in with a better option.
  2. Try to approximate the missing values using a continuous distribution with support on [0, 1] and centered on the probability parameter probW, like a LogitNormal or a Beta. This gives you something like your “belief” that this missing value is a 1. Note that in this case you now have to estimate a scale parameter for your missing values, so I don’t know if the juice is worth the squeeze. If you must have 0-1 values for down-stream calculations, perhaps you could try using a threshold, like at.ge(W_missing_continuous, 0.5) * 1.0?