Disabling missing data imputation

ckrapu · April 24, 2020, 6:53pm

I think the imputation feature in PyMC3 is fantastic and really streamlines some workflows. However, I have noticed in some applications that it leads to slower NUTS sampling. For example, if I construct a regression model with N fully observed data points, it will sample faster than the same model with N - M data points and M imputed variables.

My question is this - is there a straightforward way to disable the imputation? Alternately, would it be sufficient to simply prune the added missing variable FreeRVs from the model object after instantiation?

junpenglao · April 26, 2020, 5:35am

I think the best way is just removing the missing value from your observed variable.

ckrapu · April 26, 2020, 10:37pm

Ah, that would be the common sense solution but I’ve got a bit of an edge case. My observed data is a relatively large array and removing the missing entries would make it into a ragged array. This appears to make the sampling that I do much less efficient when I instantiate a new random variable for each row of the ragged array. In Tensorflow Probability I’ve accommodated this by simply applying an elementwise mask to zero out terms in the target log posterior density before sampling.

junpenglao · April 27, 2020, 6:24am

I see. This is in general difficult to deal with. Either you fatten your prediction and observed, or use pm.Potential with mask like what you did in TFP.

ckrapu · April 27, 2020, 6:47pm

I forgot about pm.Potential. That should work nicely. Thanks a ton!

ckrapu · May 8, 2020, 5:23pm

For future reference, here is an example of using a binary mask to ignore some values in the calculation of the posterior density:

import numpy as np
import pymc3 as pm

n = 30
p = 3

fraction_kept = 0.75

beta_true = np.random.randn(p)
X = np.random.randn(n,p)
y = np.dot(X,beta_true) + np.random.randn(n)

mask = np.random.binomial(n,fraction_missing,y.shape)

with pm.Model() as model:
    beta   = pm.Normal('beta',shape=p)
    err_sd = pm.HalfCauchy('err_sd',beta=1)
    y_hat  = pm.math.dot(X,beta)
    
    likelihood = pm.Potential('likelihood',pm.Normal.dist(mu=y_hat, sd=err_sd).logp(y)*mask)
    trace = pm.sample()

perrette · July 5, 2023, 3:35pm

Thanks much for the solution (I have the exact same problem with large multidimensional array that benefits from having its full shape). Why is this different from the model below:

with pm.Model() as model:
    beta   = pm.Normal('beta',shape=p)
    err_sd = pm.HalfCauchy('err_sd',beta=1)
    y_hat  = pm.math.dot(X,beta)

    y_misfit = (y_hat - y) * (1-mask)
    # or alternatively:
    # y_misfit = pytensor.tensor.set_subtensor(y_misfit[mask], 0)

    pm.Normal("likelihood", mu=y_misfit, sigma=err_sd, observed=np.zeros(y.shape))
    trace = pm.sample()

This example above does not really work. Why? I think it is related to the fact that the potential below:

likelihood = pm.Potential('likelihood',pm.logp(pm.Normal.dist(mu=y_hat-y, sigma=err_sd), 0)*mask)

works equivalently, as expected. But

likelihood = pm.Potential('likelihood',pm.logp(pm.Normal.dist(mu=(y_hat-y)*mask, sigma=err_sd), 0))

does not. I thought this would have resulted in something similar, since the resulting logp is the same up to a constant offset, right? But that offset seems to matter.

Anyway, thanks for the clever solution.

ckrapu · August 25, 2023, 12:25am

It looks like the second case is setting all the residuals equal to zero but still contributing their logps. The masking should always be done on the probabilities (or log probs) as opposed to the actual random value variables.

ricardoV94 · August 25, 2023, 6:18am

You can also just try pm.logp(rv[mask], value[mask]) inside a Potential.

pabloazurduy · October 10, 2023, 9:36am

using the selection of a boolean matrix I receive the following error

NotImplementedError: Logprob method not implemented for AdvancedSubtensor

I’m using

 pm.Potential('likelihood',pm.logp(pm.Normal(mu, sigma)[mask.values], obs_dataset[mask.values]))

where mask its a xr.DataArray with booleans

ricardoV94 · October 10, 2023, 11:40am

I don’t know what you tried exactly. This works for me:

import pymc as pm
import numpy as np
print(pm.__version__)  # 5.9.0

obs_dataset = np.random.normal(size=(100,))
mask = obs_dataset > 0

with pm.Model() as m:
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")
    # We need the value variables of mu and sigma
    mu_value, sigma_value = m.replace_rvs_by_values([mu, sigma])
    pm.Potential(
        'likelihood',
        pm.logp(
            pm.Normal.dist(mu_value, sigma_value, shape=(100,))[mask],
            obs_dataset[mask]
        ),
    )

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}

Sounds more complicated than just filtering out the nans.

import pytensor.tensor as pt

with pm.Model() as m:
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")

    bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
    bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
    pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}

ricardoV94 · October 10, 2023, 11:47am

For completeness, I think this is what @ckrapu was suggesting above:

with pm.Model() as m:
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")
    pm.Potential(
        'likelihood',
        pm.logp(pm.Normal.dist(mu, sigma), obs_dataset)[mask]
    )

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}

pabloazurduy · October 10, 2023, 12:01pm

hey @ricardoV94 Thanks for the quick response, for some reason I tried a few times, and worked, no idea what was the error, but thanks for adding a more recent example. I think the old rv.logp() its not longer working .

pabloazurduy · October 10, 2023, 4:49pm

@ricardoV94 a follow up question, the mask approached work fine when the dataset its fixed
however when using a MutableData its not possible to use indexing

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I have some MutableData and some coords_mutable, how can I do a mutable mask ? is there a way to model this problem in order to re-train the model in other datasets ?

ricardoV94 · October 10, 2023, 5:39pm

What exactly did you try?

pabloazurduy · October 10, 2023, 6:05pm

here is an example of what I want to do

import pytensor.tensor as pt
import numpy as np
import pymc as pm
obs_dataset = np.random.rand(5, 6)  
mask = (obs_dataset>1)

#this works 
with pm.Model() as m:

    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")

    bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
    bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
    pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}

#how to do this ? this doesn't work, how to do this when obs_dataset is mutable ?
with pm.Model() as m:
    obs_dataset = pm.MutableData('obs_dataset', obs_dataset)
    mask = (obs_dataset >1)
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")

    bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
    bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
    pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])

m.point_logps()

basically the mask is mutable, because the obs_dataset is a MutableData so how can I do the indexing in a mutable way?

ricardoV94 · October 10, 2023, 6:08pm

So taking a step back. What is the reason you want to use MutableData for the model? Speed-wise PyMC will always recompile everything when you sample after setting new data, so you won’t get any savings compared to defining a new model every time.

pabloazurduy · October 10, 2023, 6:45pm

mmmm interesting
so the reason why is basically because I’m doing CV prediction scores, and multiples predictions using different `obs_dataset’.

to be more precise my problem is a little bit more complicated. the shape of the variables mu and sigma is determined by some mutable_coords (there is a time coordinate that I use to do time series prediction). when doing cross-validation those coordinates change, and therefore the shape of mu and sigma change accordingly.
These variables, mu and sigma, are used in other parts of the model where the change in shape is needed, however, in this part, the variable “likelihood” that its a formula of the original mu, its only needed when fitting because the obs dataset its using just once, after that for the predictions I actually don’t need this value, however, because the mask is linked to the X variable (with mutable coords) makes a mess because you can’t index mu[mask] if mask is a variable.

does this makes sense ?
(thanks in advance for helping me!! )

Topic		Replies	Views
Masking missing values of predictors Questions	3	1318	July 10, 2020
Automatic imputation - array dimension problem Questions	2	666	February 10, 2022
Automatic imputation for posterior predictive check? Questions	1	529	August 17, 2020
Missing data imputation for predictors without triggering metropolis? v5 modeling	2	21	April 4, 2025
Automatic imputation - inner workings question Questions	0	411	February 15, 2022

Disabling missing data imputation

Related topics