Issue Imputing data for Gaussian Process Model

kpjmcg · March 16, 2023, 3:47am

Hi I am having an issue imputing missing data in the following model:

class LinearMean(pm.gp.mean.Mean):
    def __init__(self, intercept, slopes,M,G):
        self.intercept = intercept
        self.slopes = slopes
        
    def __call__(self, X):
        return self.intercept +self.slopes[0]*M +self.slopes[1]*G

with pm.Model() as fMBG_OU:
    Mobs     = pm.MutableData('Mobs',scale(np.log(primBnan['body'])))
    Bobs     = pm.MutableData('Bobs',scale(np.log(primBnan['brain'])).reshape(-1, 1))
    Gobs     = pm.MutableData('Gobs',scale(np.log(primBnan['group_size'])))
    Dmat = pm.MutableData('Dmat',
                          (Dmat0[primBnan.index,primBnan.index]/np.max(Dmat0)).values)
    
    rho = pm.Exponential('rho',.25)
    etasq = pm.Exponential('etasq',0.25)
#    SIGMA = pm.HalfNormal('SIGMA',0,1)
    bM = pm.Normal('bM',0,0.5)
    bG = pm.Normal('bG',0,0.5)
    a  = pm.Normal('a' ,0,1)
    
    mu_G = pm.Normal('mu_G',0,1)
    sigma_G = pm.Exponential('sigma_G',1)
    G = pm.Normal('G',mu_G,sigma_G,observed=Gobs)
    
    mu_M = pm.Normal('mu_M',0,1)
    sigma_M = pm.Exponential('sigma_M',1)
    M = pm.Normal('M',mu_M,sigma_M,observed=Mobs) 
    
    cov = etasq * pm.gp.cov.ExpQuad(input_dim=1,ls=rho)
    mu = LinearMean(intercept = a,
                   slopes    = [bM,bG],
                   M=M,
                   G=G)
    
    gp = pm.gp.Marginal(mean_func=mu,cov_func=cov)

    B = gp.marginal_likelihood('B',X=Dmat,y=Bobs,sigma=0)

    fMBG_OU_trace = pm.sample()

Mobs and Gobs contain Nan values which I wish to impute.

If I individually model Mobs/Gobs. Like so:

with pm.Model() as Gimp:
    mu = pm.Normal('mu',0,1)
    sigma = pm.Exponential('sigma',1)
    G = pm.Normal('G',mu,sigma,observed=scale(np.log(primBnan['group_size'])))
    
    gimp_trace = pm.sample()

Imputation occurs and I get values for G_missing.

Also, If I fill all the Nan values with 1, the complete model samples correctly.

Where am I going wrong?

ricardoV94 · March 16, 2023, 10:52am

I think GP may simply not perform automatic data imputation

aurimas · March 27, 2023, 3:08am

I am working on the same problem ( fellow Statistical Rethinking classmate!), and I don’t think it has anything to do with GPs, actually. It looks like missing data imputation does not work when the data passed to observed parameter is a pm.Data variable vs. just a (masked) numpy array. I am unsure whether that’s a bug or intentional, but replacing references with raw numpy arrays worked for me.

Here’s a minimum reproducible example (pyMC 5.1.2):

Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)

with pm.Model() as m:

    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)

    # This works
    X = pm.Normal("X", 0, 1, observed = masked_X)   
    
    #This even fails to sample (the GP example doesn't - but that may be model specific) 
    X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))    
    
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y) 

with m:
    trace = pm.sample()
az.summary(trace)

In the GP example, the model samples, but the coefficients produced are quite different than the ones from the model where missing values are incorporated (and also from the model, which is fit only on complete data). This makes me wonder if the missing values are replaced with something (default value?) in the process - which sounds like a bug (happy to file an issue on github!)

Topic		Replies	Views
Dealing with random missing values in a GLM model v5 modeling	0	302	July 18, 2023
Gaussian Process -Statistical rethinking v5 gaussian_process	2	837	January 28, 2023
Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,) Questions theano	12	1859	September 7, 2020
Missing Data Imputation - Obscurities Questions	0	557	January 18, 2022
"Incompatible Shapes" when building posterior predictive samples with imputed variable v5	0	44	September 20, 2024

Issue Imputing data for Gaussian Process Model

Related topics