Automatic Imputation of Multivariate models

I am trying to work with automatic imputation in a multivariate model and am seeing that while there previously used to be multivariate imputation, there is currently not.

For example Using this notebook: [pymcon] Missing value handling.ipynb · GitHub


loc = np.array([-2., 4.])
scale = np.array([2., 3.])
corr = np.array([[1., .65],
                 [.65, 1.]])
cov = np.outer(scale, scale) * corr
chol = np.linalg.cholesky(cov)

n = 500
sample = np.einsum(
    "ij,...j->...i",
    chol, np.random.randn(n, *loc.shape)
    ) + loc

n_mask = 50
row_idx = np.random.choice(n, n_mask, replace=False)
col_idx = np.random.binomial(1, .5, size=n_mask)
mask = np.zeros(sample.shape)
mask[row_idx, col_idx] = np.nan
sample_masked = np.ma.array(sample, mask=np.isnan(mask))

masked_element_idx = np.stack(np.where(sample_masked.mask))

with pm.Model() as m4:
    mu = pm.Normal('mu', mu=0, sigma=10, shape=2)
    sd_dist = pm.HalfCauchy.dist(2.5)
    chol_cov, *_ = pm.LKJCholeskyCov(
        'chol_cov', n=2, eta=1, sd_dist=sd_dist,
        compute_corr=True, store_in_trace=False)
    obs = pm.MvNormal(
        'obs', mu=mu, chol=chol_cov, observed=sample_masked)
    trace4 = pm.sample(2000, cores=4, return_inferencedata=True)

I am getting the current error:
NotImplementedError: Automatic inputation is only supported for univariate RandomVariables, but obs is multivariate .

I am currently running pymc 5.0.0.

Furthermore, if imputation from multivariate random variables has been removed in pymc, I was wondering how I would try and deal with this

cc @ricardoV94

We don’t have an API in place yet, but you can do it easy enough with a Potential

import pymc as pm
import numpy as np

import aesara.tensor as at

data = np.zeros((3,2))
data[[0, 1, 2, 2], [0, 1, 0, 1]] = np.nan
print(data)

with pm.Model() as m:
    # Create a vector of flat variables for the unobserved components of the MvNormal
    x_unobs = pm.Flat("x_unobs", shape=(np.isnan(data).sum(),))

    # Create the symbolic value of x, combining observed data and unobserved variables
    x = at.as_tensor(data)
    x = pm.Deterministic("x", at.set_subtensor(x[np.isnan(data)], x_unobs))

    # Add a Potential with the logp of the variable conditioned on `x`
    pm.Potential("x_logp", pm.logp(rv=pm.MvNormal.dist([0, 0], [[1,0],[0,1]]), value=x))
    idata = pm.sample()

Thank you I was able to implement this into my code and get the solution that I was expecting!

I am not so familiar with working with pm.Flat and pm.Potential and would really like to learn more about them if you have anything you could direct me to

Flat is just a way to sample a variable without an intrinsic prior, and Potential is a way to directly add a log-probability term to the model.