I am working on the same problem ( fellow Statistical Rethinking classmate!), and I don’t think it has anything to do with GPs, actually. It looks like missing data imputation does not work when the data passed to
observed
parameter is a pm.Data
variable vs. just a (masked) numpy array. I am unsure whether that’s a bug or intentional, but replacing references with raw numpy arrays worked for me.
Here’s a minimum reproducible example (pyMC 5.1.2):
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)
with pm.Model() as m:
β = pm.Normal("β", 0, 1)
σ = pm.Exponential("σ", 1)
# This works
X = pm.Normal("X", 0, 1, observed = masked_X)
#This even fails to sample (the GP example doesn't - but that may be model specific)
X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))
pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y)
with m:
trace = pm.sample()
az.summary(trace)
In the GP example, the model samples, but the coefficients produced are quite different than the ones from the model where missing values are incorporated (and also from the model, which is fit only on complete data). This makes me wonder if the missing values are replaced with something (default value?) in the process - which sounds like a bug (happy to file an issue on github!)