OOS predictions with missing input values

Hello,

I’m trying to generalize your solution to a very slightly different problem. I’m again having Theano issues and am confused about how to debug them. Previously you showed me how I could fit a model with missing input data and make predictions from that model with missing data. The missing features were modeled with independent normal priors.

I assumed the process would be similar if I wanted to model the missing data with a multivariate Gaussian prior. Once again, I can fit the model with missing data but I’m having trouble with posterior predictive checks in presence of missing data. I can see my issue is related to Theano tensor dimensionality and shape. I’m just not sure how to debug this one.

Here’s what I’ve tried:

1 - set up dummy data

import numpy as np
import pymc3 as pm
import theano.tensor as tt

nobs = 100

# generate from multivariate normal
means = np.zeros(3)
stds = np.ones(3)
corr = np.array([[1,.5,.6],
                 [.5, 1,.4],
                 [.6,.4, 1],])

cov = np.diag(stds).dot(corr).dot(np.diag(stds))
X = np.random.multivariate_normal(means, cov, nobs)

# generate target data 
betas = np.random.randn(3) * .1
y = X.dot(betas) + np.random.randn(nobs) * .1

# mask 25% of data
mask = np.zeros(X.shape)
a = np.random.choice(X.shape[0], int(nobs*.25), replace=True)
b = np.random.choice(X.shape[1], int(nobs*.25), replace=True)
for i in zip(a, b):
    mask[i] = 1
    
X = np.ma.masked_array(X, mask)

# split train / test set
X_train = X[:nobs / 2]
y_train = y[:nobs / 2]
X_test = X[nobs / 2:]

2 - fit model with multivariate normal prior on input data

with pm.Model() as model:
    X_mod = pm.MvNormal('X', mu=means, cov=cov, observed=X_train)
    b = pm.Normal('beta', mu=0, sd=10, shape=X_train.shape[1])
    est = tt.dot(X_mod, b)
    eps = pm.HalfCauchy('sd', beta=5)
    y_modeled = pm.Normal('y', mu=est, sd=eps, observed=y_train)    
    trace = pm.sample(200, njobs=1)

3 - run ppc with new input data containing missing values
This is the step I’m having trouble with. The Normal prior here works fine but MvNormal returns this error.:

Bad input argument to theano function with name “/pymc3/distributions/distribution.py:230” at index 2 (0-based). Wrong number of dimensions: expected 2, got 1 with shape (3,).

with pm.Model() as model:
    
    #X_mask = pm.Normal('X_mask', mu=means, sd=stds, shape=(len(X_test), 1, X_test.shape[1]))
    X_mask = pm.MvNormal('X_mask', mu=means, cov=cov, shape=(len(X_test), X_test.shape[1]))
    X_mod = tt.squeeze(X_mask) * X_test.mask + X_test.data * (1 - X_test.mask)

    b = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
    est = tt.dot(X_mod, b)
    eps = pm.HalfCauchy('sd', beta=5)
    y_modeled = pm.Normal('y', mu=est, sd=eps, shape=len(X_test))    
    ppc = pm.sample_ppc(trace, vars=[model['y'],])

Are you able to see why this code doesn’t work and how do you recommend a person debug these types of problems in general?

Thanks again for any insight.