I have the following model, where the feature matrix X
is of shape NxK and some of the regressors contain missing values.
with pm.Model(coords=coords) as rnd_icpt_mdl:
X_imp = np.nan_to_num(x=X, nan=-999)
masked_values = np.ma.masked_array(X_imp, mask=X_imp == -999)
beta = pm.Normal("beta", mu=0.0, sigma=1.0, shape=n_vars) # Slopes - same for all groups
X = pm.Normal("X", mu=0.0, sigma=1.0, observed=masked_values) # Imputed X values
eps = pm.InverseGamma(name="eps", alpha=9.0, beta=4.0) # Model error
y_hat = pm.math.dot(X, K) # Model prediction
y_like = pm.Normal("y_like", y_hat, sigma=eps, observed=data["DepVar"], dims="obs_id") # Data likelihood
I impute the missing values using a masked array.
However, there are some obscurities.
-
What is the impact of the distribution (here Normal) on my feature matrix? Are only the imputed/missing values drawn from a normal distribution, or does the whole feature matrix X become normally distributed? Based on what should I choose the distribution and its parameter?
-
Do I have to pass a shape parameter to the imputation statement? If yes, what would be the shape? Something like this?
X = pm.Normal("X", mu=0.0, sigma=1.0, observed=masked_values, shape=K) # Imputed X values
Or is the shape equal to the shape of the feature matrix, i.e. shape=(NxK) ?