Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,)

saurpan · August 28, 2020, 5:18am

I am working on a bayesian model using pymc3. I have a data set with 29 rows and three columns: “K”, “B”, “M”. Where “B” and “M” are predictor variables. column B has a few missing values.

I have built the following model, which usages pm.MvNormal distribution to support the imputation of missing values.

Model definition

Ki ∼ Normal(μi, sigma)

μi = a + bB * Bi + bM log Mi

sigma ∼ Exponential(1)

a ∼ Normal(0, 0.5)

bB ∼ Normal(0, 0.5)

bM ∼ Normal(0, 0.5)

(Mi, Bi) ∼ MVNormal((muM, muB), sigma_BM)

muM ~ Normal(0.5, 1)

muB ~ Normal(0.5,1)

# data fields
# standardize is a method to make mean 0 and standard deviation 1 
# output variable
K_ = standardize(d['kcal.per.g'])

#predictor variable
B_ = standardize(d['neocortex.prop']) # data with missing values
M_ = standardize(d['logmass'])
X_masked = np.ma.masked_invalid(np.stack([B_, M_]).T)

with pm.Model() as m15_7:
  a = pm.Normal("a", 0, 0.5)
  bB = pm.Normal("bB", 0, 0.5)
  bM = pm.Normal("bM", 0, 0.5)
  sigma = pm.Exponential("sigma", 1)

# means of the multivariate Normal distribution
  muB = pm.Normal("muB", 0, 0.5)
  muM = pm.Normal("muM", 0, 0.5)
# Cholesly decomposition
  chol_BM, rho_BM, sigma_BM = pm.LKJCholeskyCov("chol_cov_BM",
                                          eta=2, 
                                          n = 2, 
                                          sd_dist=pm.Exponential.dist(1), 
                                          compute_corr=True)  
  BM = pm.MvNormal("BM",
                   tt.stack([muB, muM]),
                   chol=chol_BM, 
                   observed = X_masked)

  # computing mean for output variable distribution
  mu = a + bB * BM[:,0] + bM * M_

  K = pm.Normal("K", mu, sigma, observed=K_)

  trace_15_7 = pm.sample()

After executing this I am getting the following error

ValueError: operands could not be broadcast together with shapes (29,2) () (29,)

Please help

Elenchus · August 28, 2020, 6:17am

I couldn’t reproduce the error with mock data; maybe check if your standardize function is returning a single column for each of the features. Can you give us more information, the full traceback, and/or a minimum reproducible example?

K_ = np.random.rand(29)
B_ = np.random.rand(29)
M_ = np.random.rand(29)

saurpan · August 28, 2020, 6:37am

B_ has few missing values that are represented as np.nan
Please retry by adding few missing values in variable B_

Elenchus · August 28, 2020, 7:06am

Yep, seeing the error with that change.
BM = pm.MvNormal("BM",tt.stack([muB, muM]),chol=chol_BM, observed = X_masked) seems to be the problem line (where the masked values are first used in the model). Interestingly, the sampling process runs almost all the way through and only errors right at the end. I can’t see anything wrong with what you’ve done though. Sampling runs fine using Normal instead of MvNormal
BM = pm.Normal("BM",tt.stack([muB, muM]),observed = X_masked)

Devs, what happens at the end of the sampling process that might cause this?

saurpan · August 28, 2020, 10:41am

Yes, you are right. could you please look into this? @pymc_devs_bot

AlexAndorra · August 28, 2020, 1:20pm

Mmmh, could be a bug, so may be worth opening a GitHub issue. What do you think @junpenglao?

junpenglao · August 28, 2020, 3:37pm

missing value in Multivariate RV is in general quite difficult, as we need a conditional distribution to do the imputation, which is currently not implemented.

There is a few more previous discussion you could take a look for work around https://discourse.pymc.io/search?q=mvnormal%20missing

saurpan · August 31, 2020, 2:11am

Understand, I have done a simple imputation and it is working. Hope to see imputation in Multivariate RV in the future releases. would be the really exciting . Thanks for the great work .

junpenglao · September 4, 2020, 11:43am

Actually, this seems like a bug - I will try to resolve it
Also, FYI you might find this discussion useful as well How to marginalize on a Multivariate Normal distribution

junpenglao · September 4, 2020, 11:55am

Meanwhile, it should work by doing:

trace = pm.sample(return_inferencedata=False, compute_convergence_checks=False)

junpenglao · September 5, 2020, 7:44pm

Alright, if you update arviz to master it should work now.
Also, the imputation in pymc3 actually also work for multivariate variables😅

saurpan · September 7, 2020, 4:34am

I have upgraded arviz using

pip install git+https://github.com/arviz-devs/arviz

but it is still giving the same error.
Only the previous suggested solution seems to work

junpenglao · September 7, 2020, 6:06am

Try:

pip install -U git+https://github.com/arviz-devs/arviz.git
pip install -U git+https://github.com/pymc-devs/pymc3.git

Topic		Replies	Views
Multivariate Normal Imputation Error v5 aesara	1	529	December 19, 2022
Handling missing values in predictor when outcome is a Multivariate Normal distribution v5	7	108	October 25, 2024
Generating a random matrix with MvNormal Questions	4	733	December 4, 2019
Model multivariate normal with separate means, dimension mismatch error Questions	2	3075	August 7, 2017
Automatic Imputation of Multivariate models Questions	4	723	December 19, 2022

Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,)

Related topics