Masking missing values of predictors

nico · July 9, 2020, 5:35pm

Dear PyMC3 Community,

I am looking for someone that worked with missing data for both predictors (Xi) and y_obs.
My understanding is that there is no need to do the imputation beforehand, e.g. as part of a preprocessing data analysis pipeline. Hence, this can be written in a Bayesian way directly.
I would like to model a Bernoulli classification based on X1 and X2 that contain missing values.
If I excluded the missing value, I could run the model but if I want to keep the missing values, I get into bugs.

I would highly appreciate any advice in regards to this.

Please find below the script.

Thank you very much in advance

x_missing = np.isnan(x_train)
X_train = np.ma.masked_array(x_train, mask=x_missing)
#y_train.shape is (97, 1)
#X_train.shape is (97, 2)
X_shape = len(x_missing)
with pm.Model() as model:
  #Define priors
  beta = pm.Normal ('beta', 0, 10) 

  #Imputation of X missing values
  Xmu = pm.Normal('Xmu', 0, 1, shape=X_shape)
  X_modeled = pm.Normal('X', mu=Xmu, sd=10, observed=X_train)

  #Define likelihood
  lp = pm.Deterministic('lp', pm.math.dot(X_modeled, beta))

  #Define posterior
  y_obs = pm.Bernoulli('y_obs', p=lp, observed=y_train)      

  #Inference
  trace = pm.sample()

junpenglao · July 10, 2020, 1:13pm

You will need to make sure Xmu is broadcastable to X_train
for example:

Xmu = pm.Normal('Xmu', 0, 1, shape=(X_shape, 1))

or

Xmu = pm.Normal('Xmu', 0, 1, shape=(X_shape, 2))

nico · July 10, 2020, 2:02pm

Hi Junpeng,

Thank you very much for your message.

I also had this thought that the issue might be related to the shape, but I still get bad initial energy when running the Model().

The X matrix has X1 and X2, both with different numbers of missing values. Do you think I’m missing something somewhere in this regards?

Also, even though it might sound stupid, should the x_missing contain the actual missing values or the boolean numpy transformation which to be masked as the latent variable?

junpenglao · July 10, 2020, 2:46pm

the generated x_missing will contain fill-in value for the masked latent variable. The bad initial energy problem you can search for some suggested solution on the discourse.

Topic		Replies	Views
Automatic imputation - array dimension problem Questions	2	667	February 10, 2022
Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,) Questions theano	12	1858	September 7, 2020
Logistic Regression w/ Missing Data? Questions	7	2861	September 11, 2017
Missing Data Imputation - Obscurities Questions	0	555	January 18, 2022
Handling missing values in predictor when outcome is a Multivariate Normal distribution v5	7	88	October 25, 2024

Masking missing values of predictors

Related topics