Truncated multivariate normal likelihood

Hello,

I am trying to fit the mean and covariance of a 2D normal distribution to some data with the complication that my data is truncated. I only have observations within a window, although the underlying distribution really is normal.

Some astronomers had this same problem (among others) while fitting GMMs and made an expectation maximization algorithm that I’ve successfully used to solve this problem (Filling the gaps: Gaussian mixture models from noisy, truncated or incomplete samples), but I’d like to try Bayesian inference as well.

As far as I understand, PyMC only has a univariate truncated normal (pymc.TruncatedNormal), so I’m trying to define my own multivariate truncated normal with constant bounds of truncation. I’m really struggling to implement the normalization constant part. Here’s an example, where the third and fourth-to-last lines don’t actually work since they use scipy to show what I’m imagining.

import os
import pymc as pm
from scipy import stats
import numpy as np
rng = np.random.default_rng()

lower_bounds = np.zeros(2)
upper_bounds = np.array([128, 32])

# I have some guesses about the mean and covariance
prior_mean = np.mean([lower_bounds, upper_bounds], axis=0)
prior_covar = np.diag([2000., 400.])

# Some mock data
data = rng.multivariate_normal(
    prior_mean + np.array([10, 15]),
    prior_covar + np.array([[10, 11], [11, 12]]),
    size=10000
)
# Simulating truncation
mask = np.all((data >= lower_bounds) & (data <= upper_bounds), axis=1)
data = data[mask]

tune = 500
draws = 500

# Modeling
with pm.Model() as model:
    # Priors
    mu = pm.Normal(
        'mu', 
        mu=prior_mean, 
        sigma=50, 
        shape=2
    )
    chol, corr, std_devs = pm.LKJCholeskyCov(
        'chol_cov',
        eta=5., # I expect low correlation
        n=2,
        sd_dist = pm.InverseGamma.dist(mu=np.diag(prior_covar), sigma=np.array([200, 50]), shape=2)
    )
    cov = pm.Deterministic('cov', chol.dot(chol.T))

    # This is what isn't working. Trying to define the log prob of a truncated normal at `value`
    def truncated_mvnormal_logp(value, mu, cov):
        # Un-truncated normal
        untruncated_logp = pm.logp(
            pm.MvNormal.dist(mu, cov),
            value
        )

        # Find normalization constant. I know that mu.eval() doesn't actually exist
        mvn = stats.multivariate_normal(mean=mu.eval(), cov=cov.eval())
        norm_const = mvn.cdf(upper_bounds) - mvn.cdf(lower_bounds)

        return untruncated_logp - np.log(norm_const)

    pm.CustomDist('x', mu, cov, logp=truncated_mvnormal_logp, observed=data)

I don’t want to just kill samples outside the window with pymc.Potential since those out-of-the-window samples really are probable, it’s just that my data doesn’t fully encompass the true distribution since I can only see samples in a window. But this is doing the same sort of thing, saying that the samples aren’t probable… Anyway, I’m all mixed up.

Any guidance on how to do this kind of thing?

You can’t use scipy code like that in the logp. Someone tried to implement it as PyTensor code a while ago, but it got abandoned: Added MVN_cdf.py by JessSpearing · Pull Request #60 · pymc-devs/pymc-extras · GitHub

Importantly you’ll also want gradients, so that you can use NUTS.

Okay, thanks for clarifying!