Logistic Regression w/ Missing Data?

Thanks for the help!

The matrix is a combination of continuous variables and categorical (nominal) variables. The nominal variables have been encoded as dummy variables (using the pandas get_dummies function). For now I’m filling in all the missing categorical data, since getting missing values imputed for the continuous stuff seemed like a lighter lift.

example X with columns and rows trimmed:

masked_array(data =
[
 [-- 0.23072215467974413 0.0] 
 [-0.2882092825160753 -- 0.0] 
 [-- -- 0.0]
 [-0.11676674468027119 -- 1.0]
 [-0.04329137132206948 -- 0.0]
 [-0.2882092825160753 -- 0.0]
 [1.1812981846479598 -1.2323976365182923 1.0]
 [-- -- 1.0]
 [0.20162653987193635 -- 0.0]
 [-- -0.13980818205222614 1.0]
],
             mask =
 [[ True False False]
 [False  True False]
 [ True  True False]
 [False  True False]
 [False  True False]
 [False  True False]
 [False False False]
 [ True  True False]
 [False  True False]
 [ True False False]],
       fill_value = -9999.0)

I think I’m understanding a bit better now. So my model should have some version of:

X_mu = some_prior(shape=D)
X_sigma = some_prior(shape=D)
X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix)

And maybe bernoulli on the categorical columns if I want to impute those? I have some variables where there are more than two categories and I’m not sure how to add the “dummy variables for category color should sum to 1 on each row” constraint

Am I on the right track here?