Disabling missing data imputation

I think the imputation feature in PyMC3 is fantastic and really streamlines some workflows. However, I have noticed in some applications that it leads to slower NUTS sampling. For example, if I construct a regression model with N fully observed data points, it will sample faster than the same model with N - M data points and M imputed variables.

My question is this - is there a straightforward way to disable the imputation? Alternately, would it be sufficient to simply prune the added missing variable FreeRVs from the model object after instantiation?

I think the best way is just removing the missing value from your observed variable.

Ah, that would be the common sense solution but I’ve got a bit of an edge case. My observed data is a relatively large array and removing the missing entries would make it into a ragged array. This appears to make the sampling that I do much less efficient when I instantiate a new random variable for each row of the ragged array. In Tensorflow Probability I’ve accommodated this by simply applying an elementwise mask to zero out terms in the target log posterior density before sampling.

1 Like

I see. This is in general difficult to deal with. Either you fatten your prediction and observed, or use pm.Potential with mask like what you did in TFP.

1 Like

I forgot about pm.Potential. That should work nicely. Thanks a ton!

For future reference, here is an example of using a binary mask to ignore some values in the calculation of the posterior density:

import numpy as np
import pymc3 as pm

n = 30
p = 3

fraction_kept = 0.75

beta_true = np.random.randn(p)
X = np.random.randn(n,p)
y = np.dot(X,beta_true) + np.random.randn(n)

mask = np.random.binomial(n,fraction_missing,y.shape)

with pm.Model() as model:
    beta   = pm.Normal('beta',shape=p)
    err_sd = pm.HalfCauchy('err_sd',beta=1)
    y_hat  = pm.math.dot(X,beta)
    likelihood = pm.Potential('likelihood',pm.Normal.dist(mu=y_hat, sd=err_sd).logp(y)*mask)
    trace = pm.sample()