Issue Imputing data for Gaussian Process Model

Hi I am having an issue imputing missing data in the following model:

class LinearMean(pm.gp.mean.Mean):
    def __init__(self, intercept, slopes,M,G):
        self.intercept = intercept
        self.slopes = slopes
        
    def __call__(self, X):
        return self.intercept +self.slopes[0]*M +self.slopes[1]*G

with pm.Model() as fMBG_OU:
    Mobs     = pm.MutableData('Mobs',scale(np.log(primBnan['body'])))
    Bobs     = pm.MutableData('Bobs',scale(np.log(primBnan['brain'])).reshape(-1, 1))
    Gobs     = pm.MutableData('Gobs',scale(np.log(primBnan['group_size'])))
    Dmat = pm.MutableData('Dmat',
                          (Dmat0[primBnan.index,primBnan.index]/np.max(Dmat0)).values)
    
    rho = pm.Exponential('rho',.25)
    etasq = pm.Exponential('etasq',0.25)
#    SIGMA = pm.HalfNormal('SIGMA',0,1)
    bM = pm.Normal('bM',0,0.5)
    bG = pm.Normal('bG',0,0.5)
    a  = pm.Normal('a' ,0,1)
    
    mu_G = pm.Normal('mu_G',0,1)
    sigma_G = pm.Exponential('sigma_G',1)
    G = pm.Normal('G',mu_G,sigma_G,observed=Gobs)
    
    mu_M = pm.Normal('mu_M',0,1)
    sigma_M = pm.Exponential('sigma_M',1)
    M = pm.Normal('M',mu_M,sigma_M,observed=Mobs) 
    
    cov = etasq * pm.gp.cov.ExpQuad(input_dim=1,ls=rho)
    mu = LinearMean(intercept = a,
                   slopes    = [bM,bG],
                   M=M,
                   G=G)
    
    gp = pm.gp.Marginal(mean_func=mu,cov_func=cov)

    B = gp.marginal_likelihood('B',X=Dmat,y=Bobs,sigma=0)

    fMBG_OU_trace = pm.sample()

Mobs and Gobs contain Nan values which I wish to impute.

If I individually model Mobs/Gobs. Like so:

with pm.Model() as Gimp:
    mu = pm.Normal('mu',0,1)
    sigma = pm.Exponential('sigma',1)
    G = pm.Normal('G',mu,sigma,observed=scale(np.log(primBnan['group_size'])))
    
    gimp_trace = pm.sample()

Imputation occurs and I get values for G_missing.

Also, If I fill all the Nan values with 1, the complete model samples correctly.

Where am I going wrong?

I think GP may simply not perform automatic data imputation

I am working on the same problem ( :wave: fellow Statistical Rethinking classmate!), and I don’t think it has anything to do with GPs, actually. It looks like missing data imputation does not work when the data passed to observed parameter is a pm.Data variable vs. just a (masked) numpy array. I am unsure whether that’s a bug or intentional, but replacing references with raw numpy arrays worked for me.

Here’s a minimum reproducible example (pyMC 5.1.2):

Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)

with pm.Model() as m:

    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)

    # This works
    X = pm.Normal("X", 0, 1, observed = masked_X)   
    
    #This even fails to sample (the GP example doesn't - but that may be model specific) 
    X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))    
    
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y) 

with m:
    trace = pm.sample()
az.summary(trace)

In the GP example, the model samples, but the coefficients produced are quite different than the ones from the model where missing values are incorporated (and also from the model, which is fit only on complete data). This makes me wonder if the missing values are replaced with something (default value?) in the process - which sounds like a bug (happy to file an issue on github!)