 Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,)

I am working on a bayesian model using pymc3. I have a data set with 29 rows and three columns: “K”, “B”, “M”. Where “B” and “M” are predictor variables. column B has a few missing values.

I have built the following model, which usages pm.MvNormal distribution to support the imputation of missing values.

Model definition

Ki ∼ Normal(μi, sigma)

μi = a + bB * Bi + bM log Mi

sigma ∼ Exponential(1)

a ∼ Normal(0, 0.5)

bB ∼ Normal(0, 0.5)

bM ∼ Normal(0, 0.5)

(Mi, Bi) ∼ MVNormal((muM, muB), sigma_BM)

muM ~ Normal(0.5, 1)

muB ~ Normal(0.5,1)

# data fields
# standardize is a method to make mean 0 and standard deviation 1
# output variable
K_ = standardize(d['kcal.per.g'])

#predictor variable
B_ = standardize(d['neocortex.prop']) # data with missing values
M_ = standardize(d['logmass'])

with pm.Model() as m15_7:
a = pm.Normal("a", 0, 0.5)
bB = pm.Normal("bB", 0, 0.5)
bM = pm.Normal("bM", 0, 0.5)
sigma = pm.Exponential("sigma", 1)

# means of the multivariate Normal distribution
muB = pm.Normal("muB", 0, 0.5)
muM = pm.Normal("muM", 0, 0.5)
# Cholesly decomposition
chol_BM, rho_BM, sigma_BM = pm.LKJCholeskyCov("chol_cov_BM",
eta=2,
n = 2,
sd_dist=pm.Exponential.dist(1),
compute_corr=True)
BM = pm.MvNormal("BM",
tt.stack([muB, muM]),
chol=chol_BM,

# computing mean for output variable distribution
mu = a + bB * BM[:,0] + bM * M_

K = pm.Normal("K", mu, sigma, observed=K_)

trace_15_7 = pm.sample()

After executing this I am getting the following error

ValueError: operands could not be broadcast together with shapes (29,2) () (29,)

1 Like

I couldn’t reproduce the error with mock data; maybe check if your standardize function is returning a single column for each of the features. Can you give us more information, the full traceback, and/or a minimum reproducible example?

K_ = np.random.rand(29)
B_ = np.random.rand(29)
M_ = np.random.rand(29)

B_ has few missing values that are represented as np.nan

Yep, seeing the error with that change.
BM = pm.MvNormal("BM",tt.stack([muB, muM]),chol=chol_BM, observed = X_masked) seems to be the problem line (where the masked values are first used in the model). Interestingly, the sampling process runs almost all the way through and only errors right at the end. I can’t see anything wrong with what you’ve done though. Sampling runs fine using Normal instead of MvNormal
BM = pm.Normal("BM",tt.stack([muB, muM]),observed = X_masked)

Devs, what happens at the end of the sampling process that might cause this?

1 Like

Yes, you are right. could you please look into this? @pymc_devs_bot

Mmmh, could be a bug, so may be worth opening a GitHub issue. What do you think @junpenglao?

missing value in Multivariate RV is in general quite difficult, as we need a conditional distribution to do the imputation, which is currently not implemented.

There is a few more previous discussion you could take a look for work around https://discourse.pymc.io/search?q=mvnormal%20missing

3 Likes

Understand, I have done a simple imputation and it is working. Hope to see imputation in Multivariate RV in the future releases. would be the really exciting . Thanks for the great work .

1 Like

Actually, this seems like a bug - I will try to resolve it
Also, FYI you might find this discussion useful as well How to marginalize on a Multivariate Normal distribution

4 Likes

Meanwhile, it should work by doing:

trace = pm.sample(return_inferencedata=False, compute_convergence_checks=False)

Alright, if you update arviz to master it should work now.
Also, the imputation in pymc3 actually also work for multivariate variables😅

2 Likes