Hi all, hopefully someone can help with this problem I’m having…
I’m trying to run a simple logistic regression with missing data in the covariates. I have an NxD numpy masked array (call it X) with all my observations in it and a length N vector of outputs (call it y).
Here’s my model (cut down in an effort to be clear/get it working):
with pm.Model() as logistic_model:
beta = pm.Normal('beta', 0.0, sd=10, shape=X.shape[1])
# Here's my problem
p = pm.math.invlogit(pm.math.dot(X, beta))
y_obs = pm.Bernoulli('y_obs', p, observed=y)
trace = pm.sample(1000)
I initially tried doing this using the GLM module and a patsy formula, but I was unable to get that working since patsy was mangling my masked array. I kept ending up with the mask value in the dmatrices.
The problem that I’m running into is with the dot product and logistic function. As far I can tell, I’m doing operations between masked numpy arrays and theano tensors. I’m not sure which how to do this. I’ve tried theano.tensor.dot
, numpy.ma.dot
and neither work. Is this currently possible?
Thanks!
Cristian
Masked arrays won’t help here. PyMC3 has support for missing values in observed variables, in this case that would be y
. To work with missing values in your predictor X
you have to specify the distribution those values. What datatypes do you have in there (continuous or discrete)? Can you post the matrix, or a subset of it?
Thanks for the help!
The matrix is a combination of continuous variables and categorical (nominal) variables. The nominal variables have been encoded as dummy variables (using the pandas get_dummies function). For now I’m filling in all the missing categorical data, since getting missing values imputed for the continuous stuff seemed like a lighter lift.
example X
with columns and rows trimmed:
masked_array(data =
[
[-- 0.23072215467974413 0.0]
[-0.2882092825160753 -- 0.0]
[-- -- 0.0]
[-0.11676674468027119 -- 1.0]
[-0.04329137132206948 -- 0.0]
[-0.2882092825160753 -- 0.0]
[1.1812981846479598 -1.2323976365182923 1.0]
[-- -- 1.0]
[0.20162653987193635 -- 0.0]
[-- -0.13980818205222614 1.0]
],
mask =
[[ True False False]
[False True False]
[ True True False]
[False True False]
[False True False]
[False True False]
[False False False]
[ True True False]
[False True False]
[ True False False]],
fill_value = -9999.0)
I think I’m understanding a bit better now. So my model should have some version of:
X_mu = some_prior(shape=D)
X_sigma = some_prior(shape=D)
X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix)
And maybe bernoulli on the categorical columns if I want to impute those? I have some variables where there are more than two categories and I’m not sure how to add the “dummy variables for category color
should sum to 1 on each row” constraint
Am I on the right track here?
You’ve got the idea, yes.
For variables with more than 2 categories, you can use a Categorical
with a Dirichlet
prior.
You can pass transform=pm.distributions.transforms.sum_to_1
to restrict values to sum to 1. Dividing those values by the sum of already known color should get the total sum right.
tt.stack
and tt.concatenate
can be used to combine the values to a matrix, but in this particular case it might be easier (but possibly a bit slower) to iterate over the rows and create an individual pm.Normal
for each observation. This would also make it easier to marginalize out discrete parameters in case the sampler can’t deal with those properly.
1 Like
I’d like to add more to this discussion.
I am dealing with a similar problem as @capdevc. The trick X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix)
works well when you directly supply a numpy masked array to the observed
parameter. However, when using this for regression, we would like to be able to swap out the X
variable for some new values, therefore theano.shared
is usually used. However, shared
converts a masked array to a normal tensor Variable, which will simply treat former fill values as actual values and pass it on to PyMC3. I tried to imitate the relevant code in pymc3/model.py
to solve this problem, but it seems that the solution there for masked arrays is to create a fake distribution for masked values, whose shape depend on the number of missing values and thus cannot be easily made flexible for a shared
with indefinite dimensions.
relevant code in pymc3/model.py
:
line 1172
def as_tensor(data, name, model, distribution):
dtype = distribution.dtype
data = pandas_to_array(data).astype(dtype)
if hasattr(data, 'mask'):
from .distributions import NoDistribution
testval = np.broadcast_to(distribution.default(), data.shape)[data.mask]
fakedist = NoDistribution.dist(shape=data.mask.sum(), dtype=dtype,
testval=testval, parent_dist=distribution)
missing_values = FreeRV(name=name + '_missing', distribution=fakedist,
model=model)
constant = tt.as_tensor_variable(data.filled())
dataTensor = tt.set_subtensor(
constant[data.mask.nonzero()], missing_values)
dataTensor.missing_values = missing_values
return dataTensor
elif sps.issparse(data):
data = sparse.basic.as_sparse(data, name=name)
data.missing_values = None
return data
else:
data = tt.as_tensor_variable(data, name=name)
data.missing_values = None
return data
I do not know how to best address this problem. Any suggestions?
No, I don’t think you can do that with shared vars at the moment. That we can’t change the shape of variables dynamically at the moment is a known limitation. There is nothing fundamentally difficult about it, but a lot of code at various places depends on this not happening, so changing that is quite a bit of work I believe.
You’ll have to redefine the model for each dataset (put the model definition in a function generate_model(data)
or so that returns the model). While this isn’t nice, it usually shouldn’t be terrible, because theano caches compiled functions, so at least the compilation happens only once.
Thanks for the reply. That does seem to be the only option for now. It does make reusing trained models for prediction harder though. Suppose I have data (X1, y1) and wish to train a regression model to predict on data X2 using sample_ppc
, then using generate_model(data)
will create two different models for which you cannot share parameters / traces. One way to work around this is to pool everything together and draw samples from ([X1, X2], [y1, unk]), but that is not very elegant or flexible.