Issue Imputing data for Gaussian Process Model

aurimas · March 27, 2023, 3:08am

I am working on the same problem ( fellow Statistical Rethinking classmate!), and I don’t think it has anything to do with GPs, actually. It looks like missing data imputation does not work when the data passed to observed parameter is a pm.Data variable vs. just a (masked) numpy array. I am unsure whether that’s a bug or intentional, but replacing references with raw numpy arrays worked for me.

Here’s a minimum reproducible example (pyMC 5.1.2):

Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)

with pm.Model() as m:

    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)

    # This works
    X = pm.Normal("X", 0, 1, observed = masked_X)   
    
    #This even fails to sample (the GP example doesn't - but that may be model specific) 
    X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))    
    
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y) 

with m:
    trace = pm.sample()
az.summary(trace)

In the GP example, the model samples, but the coefficients produced are quite different than the ones from the model where missing values are incorporated (and also from the model, which is fit only on complete data). This makes me wonder if the missing values are replaced with something (default value?) in the process - which sounds like a bug (happy to file an issue on github!)

Topic		Replies	Views
Missing data: np.nan in a multidimensional Poisson distribution v5	6	890	September 21, 2022
Dealing with random missing values in a GLM model v5 modeling	0	303	July 18, 2023
Why are MaskedArrays not supported anymore and is there suggested circumvention? v5 bug	17	922	July 29, 2024
Disabling missing data imputation Questions	17	2200	October 10, 2023
Handling missing data in linear regression on timeseries data Questions	1	1779	July 2, 2020

Issue Imputing data for Gaussian Process Model

Related topics