OOS predictions with missing input values


I’m having trouble generating out-of-sample predictions for a simple linear model with missing covariate data.

I understand how to fit a model with missing covariate data using a numpy masked array:

import numpy as np
import pymc3 as pm

# generate data with missing values
X_train = np.random.randn(4,2)
y_train = np.random.randn(4)

mask = [[0, 0], [1, 0], [0, 1], [0, 0],]
X_train = np.ma.masked_array(X_train, mask=mask)

# fit model
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    X_modeled = pm.Normal('X', mu=0, sd=1, observed=X_train)

    mu = alpha + sum([beta[i] * X_modeled[:,i] for i in range(2)])
    sd = pm.HalfCauchy('sd', beta=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=y_train)
    step = pm.NUTS()
    trace = pm.sample(500, step=step)

I also understand how to generate out of sample predictions using theano shared variables when no data is missing:

import numpy as np
import pymc3 as pm
import theano

# generate train data
X_train = np.random.randn(4,2)
y_train = np.random.randn(4)
X_input = theano.shared(X_train)

# fit model
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    mu = alpha + sum([beta[i] * X_input[:,i] for i in range(2)])
    sd = pm.HalfCauchy('sd', beta=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=y_train)
    step = pm.NUTS()
    trace = pm.sample(500, step=step)

# generate predictions on test data with fit model
X_test = np.random.randn(4,2)
ppc = pm.sample_ppc(trace, model=model)

I can’t seem to do both at the same time. Is there any way to fit a linear model with missing data and subsequently generate out of sample predictions using that fitted model with new data (which may also contain missing values)?

Dealing with multivariate x, errors in both x and y, and posterior predictive for y

Not sure if it is the best way, but my solution would be the following:

1, build a model that allows having missing values in the covariates (i.e., the predictor X). This is similar to your model one. However, to have a more meaningful prediction for missing input, a more informative prior is needed. So instead of:

X_modeled = pm.Normal('X', mu=0, sd=1, observed=X_train)

doing below will allow you to make a better prediction when some inputs are missing:

Xmu = pm.Normal('Xmu', mu=0, sd=10, shape=(1,2))
X_modeled = pm.Normal('X', mu=Xmu, sd=1., observed=X_train)

2, after getting the posterior from the model, you can plug in some new nodes for OOS predictions:

with model:
    # careful of the shape here. As the shape broadcasting might not work automatically
    X_mask3 = pm.Normal('X_mask3', mu=Xmu, sd=1., shape=(n3, 1, 2))
    Xpred = tt.squeeze(X_mask3)*mask2 + x_new*(1-mask2)
    mu3 = alpha + tt.dot(Xpred, beta)
    y3 = pm.Normal('y3', mu=mu3, sd=sd, shape=(n3,))
    ppc3 = pm.sample_ppc(trace, vars=[y3], samples=200)

The new input being x_new, with the mask being mask2. What I am doing here is that the missing element in the predictor x_new (indicated by mask2) was replaced by the posterior prediction generated similarly using a Gaussian with mean Xmu.

I think this is an interesting question so I make a small notebook, hope you find it helpful.


Thanks so much for your help here and for sharing the notebook. I’ve implemented your approach - build two nearly identical models, differing primarily in the way X is defined, rather than a single model with a theano shared variable to replace the input data. It seems to do the trick though I’m not sure I fully understand why it works.


  1. In the initial model X is an observed variable. Why do you not need to define X as an observed variable in the second model used for prediction?

  2. What is the purpose of tt.squeeze? Why could I not redefine X the same way it is defined in the first model:
    ` X_modeled = pm.Normal(‘X’, mu=Xmu, sd=1, observed=X_new)? Does this some how replace the original mask used in the first model during training by the new mask used for prediction?

  3. Is there any way to see the X_missing estimates generated in the second model when running ppc? This would help me to know that the mask is working correctly.

Again, thanks so much.


I am glad you find it useful :wink: so here is some more thought in reply to your questions:

The way X is defined only makes sense for missing data. In fact, when X is fully observed (no missing data), it is just treated as a deterministic node. In another word, if there is no missing data in X_train, X_modeled = pm.Normal('X', mu=0, sd=1, observed=X_train) is the same as X_modeled = X_train.
But when there are missing data in X_train, the missing data is conceptually the same as replacing it with a number randomly draw from N(0, 1). That is also the reason why you should have an informative prior for X_modeled, as it fills the missing value with a much more meaningful number (e.g., draw from N(Xmu, 1) instead).

tt.squeeze generates a matrix with right shape (shape broadcasting doesn’t work with shape=(n3, 2)). You are right; it should in principle also work by creating a new node with masked observed. It doesnt work with the current set up (at least not easily with the shape issue). Ultimately, what the sample_ppc is doing is indexing to the posterior sample (saved in the trace), and do forward pass to generate random samples.

You can sample the masked matrix:

with model:
    ppc3 = pm.sample_ppc(trace, vars=[y3, X_mask3], samples=200)

and index to the right value in ppc3[‘X_mask3’]



I’m trying to generalize your solution to a very slightly different problem. I’m again having Theano issues and am confused about how to debug them. Previously you showed me how I could fit a model with missing input data and make predictions from that model with missing data. The missing features were modeled with independent normal priors.

I assumed the process would be similar if I wanted to model the missing data with a multivariate Gaussian prior. Once again, I can fit the model with missing data but I’m having trouble with posterior predictive checks in presence of missing data. I can see my issue is related to Theano tensor dimensionality and shape. I’m just not sure how to debug this one.

Here’s what I’ve tried:

1 - set up dummy data

import numpy as np
import pymc3 as pm
import theano.tensor as tt

nobs = 100

# generate from multivariate normal
means = np.zeros(3)
stds = np.ones(3)
corr = np.array([[1,.5,.6],
                 [.5, 1,.4],
                 [.6,.4, 1],])

cov = np.diag(stds).dot(corr).dot(np.diag(stds))
X = np.random.multivariate_normal(means, cov, nobs)

# generate target data 
betas = np.random.randn(3) * .1
y = X.dot(betas) + np.random.randn(nobs) * .1

# mask 25% of data
mask = np.zeros(X.shape)
a = np.random.choice(X.shape[0], int(nobs*.25), replace=True)
b = np.random.choice(X.shape[1], int(nobs*.25), replace=True)
for i in zip(a, b):
    mask[i] = 1
X = np.ma.masked_array(X, mask)

# split train / test set
X_train = X[:nobs / 2]
y_train = y[:nobs / 2]
X_test = X[nobs / 2:]

2 - fit model with multivariate normal prior on input data

with pm.Model() as model:
    X_mod = pm.MvNormal('X', mu=means, cov=cov, observed=X_train)
    b = pm.Normal('beta', mu=0, sd=10, shape=X_train.shape[1])
    est = tt.dot(X_mod, b)
    eps = pm.HalfCauchy('sd', beta=5)
    y_modeled = pm.Normal('y', mu=est, sd=eps, observed=y_train)    
    trace = pm.sample(200, njobs=1)

3 - run ppc with new input data containing missing values
This is the step I’m having trouble with. The Normal prior here works fine but MvNormal returns this error.:

Bad input argument to theano function with name “/pymc3/distributions/distribution.py:230” at index 2 (0-based). Wrong number of dimensions: expected 2, got 1 with shape (3,).

with pm.Model() as model:
    #X_mask = pm.Normal('X_mask', mu=means, sd=stds, shape=(len(X_test), 1, X_test.shape[1]))
    X_mask = pm.MvNormal('X_mask', mu=means, cov=cov, shape=(len(X_test), X_test.shape[1]))
    X_mod = tt.squeeze(X_mask) * X_test.mask + X_test.data * (1 - X_test.mask)

    b = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
    est = tt.dot(X_mod, b)
    eps = pm.HalfCauchy('sd', beta=5)
    y_modeled = pm.Normal('y', mu=est, sd=eps, shape=len(X_test))    
    ppc = pm.sample_ppc(trace, vars=[model['y'],])

Are you able to see why this code doesn’t work and how do you recommend a person debug these types of problems in general?

Thanks again for any insight.


There are not many ways to debug the error in sample_ppc. For example, in your case it is essentially doing y_modeled.random(trace[random_index]), which will return the same error message so it doesnt help.
I would probably just write my own OOS prediction code instead of building a new model, which is much faster and you also have better control:

nppc = 100
ppc = []
for idx in range(nppc):
    X_mask = np.random.multivariate_normal(means, cov, X_test.shape[0])
    X_mod = X_mask * X_test.mask + X_test.data * (1 - X_test.mask)
    point = trace[idx]
    b = point['beta']
    eps = point['sd']
    est = np.matmul(X_mod, b)
    y_pred = np.random.normal(est, eps)

However, the above code is wrong in multivariate case. You should NOT use a i.i.d. multivariate drop-in to replace the missing value in X_test. In this case, the missing information is partially missing – for example, say one row in X is [ma., 0.5, 1.2] and X ~ MvNormal([0., 0., 0.], cov). Instead of generate X_ ~ MvNormal([0., 0., 0.], cov) and do ma. = X_[0], you need to work out the conditional probability of ma. by marginalizing the MvNormal using [.5, 1.2] and cov, and generate from this new conditional probability (also a Gaussian) instead.