I’d like to add more to this discussion.
I am dealing with a similar problem as @capdevc. The trick X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix) works well when you directly supply a numpy masked array to the observed parameter. However, when using this for regression, we would like to be able to swap out the X variable for some new values, therefore theano.shared is usually used. However, shared converts a masked array to a normal tensor Variable, which will simply treat former fill values as actual values and pass it on to PyMC3. I tried to imitate the relevant code in pymc3/model.py to solve this problem, but it seems that the solution there for masked arrays is to create a fake distribution for masked values, whose shape depend on the number of missing values and thus cannot be easily made flexible for a shared with indefinite dimensions.
relevant code in pymc3/model.py:
line 1172
def as_tensor(data, name, model, distribution):
dtype = distribution.dtype
data = pandas_to_array(data).astype(dtype)
if hasattr(data, 'mask'):
from .distributions import NoDistribution
testval = np.broadcast_to(distribution.default(), data.shape)[data.mask]
fakedist = NoDistribution.dist(shape=data.mask.sum(), dtype=dtype,
testval=testval, parent_dist=distribution)
missing_values = FreeRV(name=name + '_missing', distribution=fakedist,
model=model)
constant = tt.as_tensor_variable(data.filled())
dataTensor = tt.set_subtensor(
constant[data.mask.nonzero()], missing_values)
dataTensor.missing_values = missing_values
return dataTensor
elif sps.issparse(data):
data = sparse.basic.as_sparse(data, name=name)
data.missing_values = None
return data
else:
data = tt.as_tensor_variable(data, name=name)
data.missing_values = None
return data
I do not know how to best address this problem. Any suggestions?