Disabling missing data imputation

I think the imputation feature in PyMC3 is fantastic and really streamlines some workflows. However, I have noticed in some applications that it leads to slower NUTS sampling. For example, if I construct a regression model with N fully observed data points, it will sample faster than the same model with N - M data points and M imputed variables.

My question is this - is there a straightforward way to disable the imputation? Alternately, would it be sufficient to simply prune the added missing variable FreeRVs from the model object after instantiation?

I think the best way is just removing the missing value from your observed variable.

Ah, that would be the common sense solution but I’ve got a bit of an edge case. My observed data is a relatively large array and removing the missing entries would make it into a ragged array. This appears to make the sampling that I do much less efficient when I instantiate a new random variable for each row of the ragged array. In Tensorflow Probability I’ve accommodated this by simply applying an elementwise mask to zero out terms in the target log posterior density before sampling.

1 Like

I see. This is in general difficult to deal with. Either you fatten your prediction and observed, or use pm.Potential with mask like what you did in TFP.

1 Like

I forgot about pm.Potential. That should work nicely. Thanks a ton!

For future reference, here is an example of using a binary mask to ignore some values in the calculation of the posterior density:

import numpy as np
import pymc3 as pm

n = 30
p = 3

fraction_kept = 0.75

beta_true = np.random.randn(p)
X = np.random.randn(n,p)
y = np.dot(X,beta_true) + np.random.randn(n)

mask = np.random.binomial(n,fraction_missing,y.shape)

with pm.Model() as model:
    beta   = pm.Normal('beta',shape=p)
    err_sd = pm.HalfCauchy('err_sd',beta=1)
    y_hat  = pm.math.dot(X,beta)
    
    likelihood = pm.Potential('likelihood',pm.Normal.dist(mu=y_hat, sd=err_sd).logp(y)*mask)
    trace = pm.sample()
7 Likes

Thanks much for the solution (I have the exact same problem with large multidimensional array that benefits from having its full shape). Why is this different from the model below:

with pm.Model() as model:
    beta   = pm.Normal('beta',shape=p)
    err_sd = pm.HalfCauchy('err_sd',beta=1)
    y_hat  = pm.math.dot(X,beta)

    y_misfit = (y_hat - y) * (1-mask)
    # or alternatively:
    # y_misfit = pytensor.tensor.set_subtensor(y_misfit[mask], 0)

    pm.Normal("likelihood", mu=y_misfit, sigma=err_sd, observed=np.zeros(y.shape))
    trace = pm.sample()

This example above does not really work. Why? I think it is related to the fact that the potential below:

likelihood = pm.Potential('likelihood',pm.logp(pm.Normal.dist(mu=y_hat-y, sigma=err_sd), 0)*mask)

works equivalently, as expected. But

likelihood = pm.Potential('likelihood',pm.logp(pm.Normal.dist(mu=(y_hat-y)*mask, sigma=err_sd), 0))

does not. I thought this would have resulted in something similar, since the resulting logp is the same up to a constant offset, right? But that offset seems to matter.

Anyway, thanks for the clever solution.

It looks like the second case is setting all the residuals equal to zero but still contributing their logps. The masking should always be done on the probabilities (or log probs) as opposed to the actual random value variables.

You can also just try pm.logp(rv[mask], value[mask]) inside a Potential.

1 Like

using the selection of a boolean matrix I receive the following error

NotImplementedError: Logprob method not implemented for AdvancedSubtensor

I’m using

 pm.Potential('likelihood',pm.logp(pm.Normal(mu, sigma)[mask.values], obs_dataset[mask.values]))

where mask its a xr.DataArray with booleans

I don’t know what you tried exactly. This works for me:

import pymc as pm
import numpy as np
print(pm.__version__)  # 5.9.0

obs_dataset = np.random.normal(size=(100,))
mask = obs_dataset > 0

with pm.Model() as m:
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")
    # We need the value variables of mu and sigma
    mu_value, sigma_value = m.replace_rvs_by_values([mu, sigma])
    pm.Potential(
        'likelihood',
        pm.logp(
            pm.Normal.dist(mu_value, sigma_value, shape=(100,))[mask],
            obs_dataset[mask]
        ),
    )

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}

Sounds more complicated than just filtering out the nans.

import pytensor.tensor as pt

with pm.Model() as m:
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")

    bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
    bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
    pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}
2 Likes

For completeness, I think this is what @ckrapu was suggesting above:

with pm.Model() as m:
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")
    pm.Potential(
        'likelihood',
        pm.logp(pm.Normal.dist(mu, sigma), obs_dataset)[mask]
    )

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}
1 Like

hey @ricardoV94 Thanks for the quick response, for some reason I tried a few times, and worked, no idea what was the error, but thanks for adding a more recent example. I think the old rv.logp() its not longer working :slight_smile: .

@ricardoV94 a follow up question, the mask approached work fine when the dataset its fixed
however when using a MutableData its not possible to use indexing

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I have some MutableData and some coords_mutable, how can I do a mutable mask ? is there a way to model this problem in order to re-train the model in other datasets ?

What exactly did you try?

here is an example of what I want to do

import pytensor.tensor as pt
import numpy as np
import pymc as pm
obs_dataset = np.random.rand(5, 6)  
mask = (obs_dataset>1)

#this works 
with pm.Model() as m:

    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")

    bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
    bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
    pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])

m.point_logps()  # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}

#how to do this ? this doesn't work, how to do this when obs_dataset is mutable ?
with pm.Model() as m:
    obs_dataset = pm.MutableData('obs_dataset', obs_dataset)
    mask = (obs_dataset >1)
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")

    bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
    bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
    pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])

m.point_logps()  

basically the mask is mutable, because the obs_dataset is a MutableData so how can I do the indexing in a mutable way?

So taking a step back. What is the reason you want to use MutableData for the model? Speed-wise PyMC will always recompile everything when you sample after setting new data, so you won’t get any savings compared to defining a new model every time.

mmmm interesting :thinking:
so the reason why is basically because I’m doing CV prediction scores, and multiples predictions using different `obs_dataset’.

to be more precise my problem is a little bit more complicated. the shape of the variables mu and sigma is determined by some mutable_coords (there is a time coordinate that I use to do time series prediction). when doing cross-validation those coordinates change, and therefore the shape of mu and sigma change accordingly.
These variables, mu and sigma, are used in other parts of the model where the change in shape is needed, however, in this part, the variable “likelihood” that its a formula of the original mu, its only needed when fitting because the obs dataset its using just once, after that for the predictions I actually don’t need this value, however, because the mask is linked to the X variable (with mutable coords) makes a mess because you can’t index mu[mask] if mask is a variable.

does this makes sense ?
(thanks in advance for helping me!! :pray: )