Logistic Regression w/ Missing Data?

I’d like to add more to this discussion.

I am dealing with a similar problem as @capdevc. The trick X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix) works well when you directly supply a numpy masked array to the observed parameter. However, when using this for regression, we would like to be able to swap out the X variable for some new values, therefore theano.shared is usually used. However, shared converts a masked array to a normal tensor Variable, which will simply treat former fill values as actual values and pass it on to PyMC3. I tried to imitate the relevant code in pymc3/model.py to solve this problem, but it seems that the solution there for masked arrays is to create a fake distribution for masked values, whose shape depend on the number of missing values and thus cannot be easily made flexible for a shared with indefinite dimensions.

relevant code in pymc3/model.py:
line 1172

def as_tensor(data, name, model, distribution):
    dtype = distribution.dtype
    data = pandas_to_array(data).astype(dtype)

    if hasattr(data, 'mask'):
        from .distributions import NoDistribution
        testval = np.broadcast_to(distribution.default(), data.shape)[data.mask]
        fakedist = NoDistribution.dist(shape=data.mask.sum(), dtype=dtype,
                                       testval=testval, parent_dist=distribution)
        missing_values = FreeRV(name=name + '_missing', distribution=fakedist,
                                model=model)
        constant = tt.as_tensor_variable(data.filled())

        dataTensor = tt.set_subtensor(
            constant[data.mask.nonzero()], missing_values)
        dataTensor.missing_values = missing_values
        return dataTensor
    elif sps.issparse(data):
        data = sparse.basic.as_sparse(data, name=name)
        data.missing_values = None
        return data
    else:
        data = tt.as_tensor_variable(data, name=name)
        data.missing_values = None
        return data

I do not know how to best address this problem. Any suggestions?