Logistic Regression w/ Missing Data?

Hi all, hopefully someone can help with this problem I’m having…

I’m trying to run a simple logistic regression with missing data in the covariates. I have an NxD numpy masked array (call it X) with all my observations in it and a length N vector of outputs (call it y).

Here’s my model (cut down in an effort to be clear/get it working):

with pm.Model() as logistic_model:
    beta = pm.Normal('beta', 0.0, sd=10, shape=X.shape[1])    
    # Here's my problem
    p = pm.math.invlogit(pm.math.dot(X, beta))    

    y_obs = pm.Bernoulli('y_obs', p, observed=y)

    trace = pm.sample(1000)

I initially tried doing this using the GLM module and a patsy formula, but I was unable to get that working since patsy was mangling my masked array. I kept ending up with the mask value in the dmatrices.

The problem that I’m running into is with the dot product and logistic function. As far I can tell, I’m doing operations between masked numpy arrays and theano tensors. I’m not sure which how to do this. I’ve tried theano.tensor.dot, numpy.ma.dot and neither work. Is this currently possible?



Masked arrays won’t help here. PyMC3 has support for missing values in observed variables, in this case that would be y. To work with missing values in your predictor X you have to specify the distribution those values. What datatypes do you have in there (continuous or discrete)? Can you post the matrix, or a subset of it?

Thanks for the help!

The matrix is a combination of continuous variables and categorical (nominal) variables. The nominal variables have been encoded as dummy variables (using the pandas get_dummies function). For now I’m filling in all the missing categorical data, since getting missing values imputed for the continuous stuff seemed like a lighter lift.

example X with columns and rows trimmed:

masked_array(data =
 [-- 0.23072215467974413 0.0] 
 [-0.2882092825160753 -- 0.0] 
 [-- -- 0.0]
 [-0.11676674468027119 -- 1.0]
 [-0.04329137132206948 -- 0.0]
 [-0.2882092825160753 -- 0.0]
 [1.1812981846479598 -1.2323976365182923 1.0]
 [-- -- 1.0]
 [0.20162653987193635 -- 0.0]
 [-- -0.13980818205222614 1.0]
             mask =
 [[ True False False]
 [False  True False]
 [ True  True False]
 [False  True False]
 [False  True False]
 [False  True False]
 [False False False]
 [ True  True False]
 [False  True False]
 [ True False False]],
       fill_value = -9999.0)

I think I’m understanding a bit better now. So my model should have some version of:

X_mu = some_prior(shape=D)
X_sigma = some_prior(shape=D)
X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix)

And maybe bernoulli on the categorical columns if I want to impute those? I have some variables where there are more than two categories and I’m not sure how to add the “dummy variables for category color should sum to 1 on each row” constraint

Am I on the right track here?

You’ve got the idea, yes.
For variables with more than 2 categories, you can use a Categorical with a Dirichlet prior.
You can pass transform=pm.distributions.transforms.sum_to_1 to restrict values to sum to 1. Dividing those values by the sum of already known color should get the total sum right.
tt.stack and tt.concatenate can be used to combine the values to a matrix, but in this particular case it might be easier (but possibly a bit slower) to iterate over the rows and create an individual pm.Normal for each observation. This would also make it easier to marginalize out discrete parameters in case the sampler can’t deal with those properly.

1 Like

Thanks so much,

That was super helpful!

I’d like to add more to this discussion.

I am dealing with a similar problem as @capdevc. The trick X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix) works well when you directly supply a numpy masked array to the observed parameter. However, when using this for regression, we would like to be able to swap out the X variable for some new values, therefore theano.shared is usually used. However, shared converts a masked array to a normal tensor Variable, which will simply treat former fill values as actual values and pass it on to PyMC3. I tried to imitate the relevant code in pymc3/model.py to solve this problem, but it seems that the solution there for masked arrays is to create a fake distribution for masked values, whose shape depend on the number of missing values and thus cannot be easily made flexible for a shared with indefinite dimensions.

relevant code in pymc3/model.py:
line 1172

def as_tensor(data, name, model, distribution):
    dtype = distribution.dtype
    data = pandas_to_array(data).astype(dtype)

    if hasattr(data, 'mask'):
        from .distributions import NoDistribution
        testval = np.broadcast_to(distribution.default(), data.shape)[data.mask]
        fakedist = NoDistribution.dist(shape=data.mask.sum(), dtype=dtype,
                                       testval=testval, parent_dist=distribution)
        missing_values = FreeRV(name=name + '_missing', distribution=fakedist,
        constant = tt.as_tensor_variable(data.filled())

        dataTensor = tt.set_subtensor(
            constant[data.mask.nonzero()], missing_values)
        dataTensor.missing_values = missing_values
        return dataTensor
    elif sps.issparse(data):
        data = sparse.basic.as_sparse(data, name=name)
        data.missing_values = None
        return data
        data = tt.as_tensor_variable(data, name=name)
        data.missing_values = None
        return data

I do not know how to best address this problem. Any suggestions?

No, I don’t think you can do that with shared vars at the moment. That we can’t change the shape of variables dynamically at the moment is a known limitation. There is nothing fundamentally difficult about it, but a lot of code at various places depends on this not happening, so changing that is quite a bit of work I believe.
You’ll have to redefine the model for each dataset (put the model definition in a function generate_model(data) or so that returns the model). While this isn’t nice, it usually shouldn’t be terrible, because theano caches compiled functions, so at least the compilation happens only once.

Thanks for the reply. That does seem to be the only option for now. It does make reusing trained models for prediction harder though. Suppose I have data (X1, y1) and wish to train a regression model to predict on data X2 using sample_ppc, then using generate_model(data) will create two different models for which you cannot share parameters / traces. One way to work around this is to pool everything together and draw samples from ([X1, X2], [y1, unk]), but that is not very elegant or flexible.