Thanks for the help!
The matrix is a combination of continuous variables and categorical (nominal) variables. The nominal variables have been encoded as dummy variables (using the pandas get_dummies function). For now I’m filling in all the missing categorical data, since getting missing values imputed for the continuous stuff seemed like a lighter lift.
example X with columns and rows trimmed:
masked_array(data =
[
[-- 0.23072215467974413 0.0]
[-0.2882092825160753 -- 0.0]
[-- -- 0.0]
[-0.11676674468027119 -- 1.0]
[-0.04329137132206948 -- 0.0]
[-0.2882092825160753 -- 0.0]
[1.1812981846479598 -1.2323976365182923 1.0]
[-- -- 1.0]
[0.20162653987193635 -- 0.0]
[-- -0.13980818205222614 1.0]
],
mask =
[[ True False False]
[False True False]
[ True True False]
[False True False]
[False True False]
[False True False]
[False False False]
[ True True False]
[False True False]
[ True False False]],
fill_value = -9999.0)
I think I’m understanding a bit better now. So my model should have some version of:
X_mu = some_prior(shape=D)
X_sigma = some_prior(shape=D)
X = pm.Normal('X', mu=X_mu, sd=X_sigma, observed=X_masked_matrix)
And maybe bernoulli on the categorical columns if I want to impute those? I have some variables where there are more than two categories and I’m not sure how to add the “dummy variables for category color should sum to 1 on each row” constraint
Am I on the right track here?