I think the imputation feature in PyMC3 is fantastic and really streamlines some workflows. However, I have noticed in some applications that it leads to slower NUTS sampling. For example, if I construct a regression model with N fully observed data points, it will sample faster than the same model with N - M data points and M imputed variables.
My question is this - is there a straightforward way to disable the imputation? Alternately, would it be sufficient to simply prune the added missing variable FreeRVs from the model object after instantiation?
Ah, that would be the common sense solution but I’ve got a bit of an edge case. My observed data is a relatively large array and removing the missing entries would make it into a ragged array. This appears to make the sampling that I do much less efficient when I instantiate a new random variable for each row of the ragged array. In Tensorflow Probability I’ve accommodated this by simply applying an elementwise mask to zero out terms in the target log posterior density before sampling.
I see. This is in general difficult to deal with. Either you fatten your prediction and observed, or use pm.Potential with mask like what you did in TFP.
Thanks much for the solution (I have the exact same problem with large multidimensional array that benefits from having its full shape). Why is this different from the model below:
does not. I thought this would have resulted in something similar, since the resulting logp is the same up to a constant offset, right? But that offset seems to matter.
It looks like the second case is setting all the residuals equal to zero but still contributing their logps. The masking should always be done on the probabilities (or log probs) as opposed to the actual random value variables.
hey @ricardoV94 Thanks for the quick response, for some reason I tried a few times, and worked, no idea what was the error, but thanks for adding a more recent example. I think the old rv.logp() its not longer working .
@ricardoV94 a follow up question, the mask approached work fine when the dataset its fixed
however when using a MutableData its not possible to use indexing
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I have some MutableData and some coords_mutable, how can I do a mutable mask ? is there a way to model this problem in order to re-train the model in other datasets ?
import pytensor.tensor as pt
import numpy as np
import pymc as pm
obs_dataset = np.random.rand(5, 6)
mask = (obs_dataset>1)
#this works
with pm.Model() as m:
mu = pm.Normal("mu")
sigma = pm.HalfNormal("sigma")
bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])
m.point_logps() # {'mu': -0.92, 'sigma': -0.73, 'likelihood': -61.98}
#how to do this ? this doesn't work, how to do this when obs_dataset is mutable ?
with pm.Model() as m:
obs_dataset = pm.MutableData('obs_dataset', obs_dataset)
mask = (obs_dataset >1)
mu = pm.Normal("mu")
sigma = pm.HalfNormal("sigma")
bcast_mu = pt.broadcast_to(mu, obs_dataset.shape)
bcast_sigma = pt.broadcast_to(sigma, obs_dataset.shape)
pm.Normal("likelihood", bcast_mu[mask], bcast_sigma[mask], observed=obs_dataset[mask])
m.point_logps()
basically the mask is mutable, because the obs_dataset is a MutableData so how can I do the indexing in a mutable way?
So taking a step back. What is the reason you want to use MutableData for the model? Speed-wise PyMC will always recompile everything when you sample after setting new data, so you won’t get any savings compared to defining a new model every time.
mmmm interesting
so the reason why is basically because I’m doing CV prediction scores, and multiples predictions using different `obs_dataset’.
to be more precise my problem is a little bit more complicated. the shape of the variables mu and sigma is determined by some mutable_coords (there is a time coordinate that I use to do time series prediction). when doing cross-validation those coordinates change, and therefore the shape of mu and sigma change accordingly.
These variables, mu and sigma, are used in other parts of the model where the change in shape is needed, however, in this part, the variable “likelihood” that its a formula of the original mu, its only needed when fitting because the obs dataset its using just once, after that for the predictions I actually don’t need this value, however, because the mask is linked to the X variable (with mutable coords) makes a mess because you can’t index mu[mask] if mask is a variable.
does this makes sense ?
(thanks in advance for helping me!! )