OOS predictions with missing input values


I’m having trouble generating out-of-sample predictions for a simple linear model with missing covariate data.

I understand how to fit a model with missing covariate data using a numpy masked array:

import numpy as np
import pymc3 as pm

# generate data with missing values
X_train = np.random.randn(4,2)
y_train = np.random.randn(4)

mask = [[0, 0], [1, 0], [0, 1], [0, 0],]
X_train = np.ma.masked_array(X_train, mask=mask)

# fit model
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    X_modeled = pm.Normal('X', mu=0, sd=1, observed=X_train)

    mu = alpha + sum([beta[i] * X_modeled[:,i] for i in range(2)])
    sd = pm.HalfCauchy('sd', beta=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=y_train)
    step = pm.NUTS()
    trace = pm.sample(500, step=step)

I also understand how to generate out of sample predictions using theano shared variables when no data is missing:

import numpy as np
import pymc3 as pm
import theano

# generate train data
X_train = np.random.randn(4,2)
y_train = np.random.randn(4)
X_input = theano.shared(X_train)

# fit model
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    mu = alpha + sum([beta[i] * X_input[:,i] for i in range(2)])
    sd = pm.HalfCauchy('sd', beta=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=y_train)
    step = pm.NUTS()
    trace = pm.sample(500, step=step)

# generate predictions on test data with fit model
X_test = np.random.randn(4,2)
ppc = pm.sample_ppc(trace, model=model)

I can’t seem to do both at the same time. Is there any way to fit a linear model with missing data and subsequently generate out of sample predictions using that fitted model with new data (which may also contain missing values)?

How to marginalize on a Multivariate Normal distribution
Dealing with multivariate x, errors in both x and y, and posterior predictive for y

Not sure if it is the best way, but my solution would be the following:

1, build a model that allows having missing values in the covariates (i.e., the predictor X). This is similar to your model one. However, to have a more meaningful prediction for missing input, a more informative prior is needed. So instead of:

X_modeled = pm.Normal('X', mu=0, sd=1, observed=X_train)

doing below will allow you to make a better prediction when some inputs are missing:

Xmu = pm.Normal('Xmu', mu=0, sd=10, shape=(1,2))
X_modeled = pm.Normal('X', mu=Xmu, sd=1., observed=X_train)

2, after getting the posterior from the model, you can plug in some new nodes for OOS predictions:

with model:
    # careful of the shape here. As the shape broadcasting might not work automatically
    X_mask3 = pm.Normal('X_mask3', mu=Xmu, sd=1., shape=(n3, 1, 2))
    Xpred = tt.squeeze(X_mask3)*mask2 + x_new*(1-mask2)
    mu3 = alpha + tt.dot(Xpred, beta)
    y3 = pm.Normal('y3', mu=mu3, sd=sd, shape=(n3,))
    ppc3 = pm.sample_ppc(trace, vars=[y3], samples=200)

The new input being x_new, with the mask being mask2. What I am doing here is that the missing element in the predictor x_new (indicated by mask2) was replaced by the posterior prediction generated similarly using a Gaussian with mean Xmu.

I think this is an interesting question so I make a small notebook, hope you find it helpful.


Thanks so much for your help here and for sharing the notebook. I’ve implemented your approach - build two nearly identical models, differing primarily in the way X is defined, rather than a single model with a theano shared variable to replace the input data. It seems to do the trick though I’m not sure I fully understand why it works.


  1. In the initial model X is an observed variable. Why do you not need to define X as an observed variable in the second model used for prediction?

  2. What is the purpose of tt.squeeze? Why could I not redefine X the same way it is defined in the first model:
    ` X_modeled = pm.Normal(‘X’, mu=Xmu, sd=1, observed=X_new)? Does this some how replace the original mask used in the first model during training by the new mask used for prediction?

  3. Is there any way to see the X_missing estimates generated in the second model when running ppc? This would help me to know that the mask is working correctly.

Again, thanks so much.


I am glad you find it useful :wink: so here is some more thought in reply to your questions:

The way X is defined only makes sense for missing data. In fact, when X is fully observed (no missing data), it is just treated as a deterministic node. In another word, if there is no missing data in X_train, X_modeled = pm.Normal('X', mu=0, sd=1, observed=X_train) is the same as X_modeled = X_train.
But when there are missing data in X_train, the missing data is conceptually the same as replacing it with a number randomly draw from N(0, 1). That is also the reason why you should have an informative prior for X_modeled, as it fills the missing value with a much more meaningful number (e.g., draw from N(Xmu, 1) instead).

tt.squeeze generates a matrix with right shape (shape broadcasting doesn’t work with shape=(n3, 2)). You are right; it should in principle also work by creating a new node with masked observed. It doesnt work with the current set up (at least not easily with the shape issue). Ultimately, what the sample_ppc is doing is indexing to the posterior sample (saved in the trace), and do forward pass to generate random samples.

You can sample the masked matrix:

with model:
    ppc3 = pm.sample_ppc(trace, vars=[y3, X_mask3], samples=200)

and index to the right value in ppc3[‘X_mask3’]



I’m trying to generalize your solution to a very slightly different problem. I’m again having Theano issues and am confused about how to debug them. Previously you showed me how I could fit a model with missing input data and make predictions from that model with missing data. The missing features were modeled with independent normal priors.

I assumed the process would be similar if I wanted to model the missing data with a multivariate Gaussian prior. Once again, I can fit the model with missing data but I’m having trouble with posterior predictive checks in presence of missing data. I can see my issue is related to Theano tensor dimensionality and shape. I’m just not sure how to debug this one.

Here’s what I’ve tried:

1 - set up dummy data

import numpy as np
import pymc3 as pm
import theano.tensor as tt

nobs = 100

# generate from multivariate normal
means = np.zeros(3)
stds = np.ones(3)
corr = np.array([[1,.5,.6],
                 [.5, 1,.4],
                 [.6,.4, 1],])

cov = np.diag(stds).dot(corr).dot(np.diag(stds))
X = np.random.multivariate_normal(means, cov, nobs)

# generate target data 
betas = np.random.randn(3) * .1
y = X.dot(betas) + np.random.randn(nobs) * .1

# mask 25% of data
mask = np.zeros(X.shape)
a = np.random.choice(X.shape[0], int(nobs*.25), replace=True)
b = np.random.choice(X.shape[1], int(nobs*.25), replace=True)
for i in zip(a, b):
    mask[i] = 1
X = np.ma.masked_array(X, mask)

# split train / test set
X_train = X[:nobs / 2]
y_train = y[:nobs / 2]
X_test = X[nobs / 2:]

2 - fit model with multivariate normal prior on input data

with pm.Model() as model:
    X_mod = pm.MvNormal('X', mu=means, cov=cov, observed=X_train)
    b = pm.Normal('beta', mu=0, sd=10, shape=X_train.shape[1])
    est = tt.dot(X_mod, b)
    eps = pm.HalfCauchy('sd', beta=5)
    y_modeled = pm.Normal('y', mu=est, sd=eps, observed=y_train)    
    trace = pm.sample(200, njobs=1)

3 - run ppc with new input data containing missing values
This is the step I’m having trouble with. The Normal prior here works fine but MvNormal returns this error.:

Bad input argument to theano function with name “/pymc3/distributions/distribution.py:230” at index 2 (0-based). Wrong number of dimensions: expected 2, got 1 with shape (3,).

with pm.Model() as model:
    #X_mask = pm.Normal('X_mask', mu=means, sd=stds, shape=(len(X_test), 1, X_test.shape[1]))
    X_mask = pm.MvNormal('X_mask', mu=means, cov=cov, shape=(len(X_test), X_test.shape[1]))
    X_mod = tt.squeeze(X_mask) * X_test.mask + X_test.data * (1 - X_test.mask)

    b = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
    est = tt.dot(X_mod, b)
    eps = pm.HalfCauchy('sd', beta=5)
    y_modeled = pm.Normal('y', mu=est, sd=eps, shape=len(X_test))    
    ppc = pm.sample_ppc(trace, vars=[model['y'],])

Are you able to see why this code doesn’t work and how do you recommend a person debug these types of problems in general?

Thanks again for any insight.


There are not many ways to debug the error in sample_ppc. For example, in your case it is essentially doing y_modeled.random(trace[random_index]), which will return the same error message so it doesnt help.
I would probably just write my own OOS prediction code instead of building a new model, which is much faster and you also have better control:

nppc = 100
ppc = []
for idx in range(nppc):
    X_mask = np.random.multivariate_normal(means, cov, X_test.shape[0])
    X_mod = X_mask * X_test.mask + X_test.data * (1 - X_test.mask)
    point = trace[idx]
    b = point['beta']
    eps = point['sd']
    est = np.matmul(X_mod, b)
    y_pred = np.random.normal(est, eps)

However, the above code is wrong in multivariate case. You should NOT use a i.i.d. multivariate drop-in to replace the missing value in X_test. In this case, the missing information is partially missing – for example, say one row in X is [ma., 0.5, 1.2] and X ~ MvNormal([0., 0., 0.], cov). Instead of generate X_ ~ MvNormal([0., 0., 0.], cov) and do ma. = X_[0], you need to work out the conditional probability of ma. by marginalizing the MvNormal using [.5, 1.2] and cov, and generate from this new conditional probability (also a Gaussian) instead.

What does the hierarchical model look like when having missing in observed?

I know this is a really old thread but can someone please help me figure out how I can use PyMC3 to do exactly as quoted?