Marginal Likelihhood Calculation with harmonic Package

Hello PyMC Community!

I’ve uploaded two Jupyter notebooks that demonstrate the use the of learnt harmonic mean estimator for calculating the marginal likelihood for PyMC models:

Harmonic Ground Truth Test

Harmonic Test with nutpie

The first notebook uses a “ground truth” model where we know the analytical solution to the marginal likelihood and compares the performance of the learnt harmonic mean marginal likelihood estimator. The sampler is the standard NUTS sampler available in PyMC.

The second notebook skips the ground truth part, and simply shows how to use the harmonic package when using nutpie for the sampling.

For Mac and Linux users, the setup is straightforward:

# 1. Create the environment (conda-forge is still recommended for PyMC)
conda create -n harmonic_env -c conda-forge python=3.11 pymc nutpie arviz numpy scipy matplotlib jupyterlab

# 2. Activate
conda activate harmonic_env

# 3. Install Harmonic
# On macOS, if you get compiler errors, ensure XCode Command Line Tools are installed: xcode-select --install
pip install harmonic

For Windows users, the setup is a bit more involved because harmonic compiles some code, so a C++ compiler is needed:

Step 1: harmonic needs to compile C extensions during installation

  1. Download Microsoft Visual C++ Build Tools.

  2. Run the installer.

  3. Crucial: Check the box for “Desktop development with C++”.

  4. Click Install and restart your terminal/PC when finished.

Step 2: Create the Environment (Strict Conda-Forge)

Open your generic terminal (PowerShell or Command Prompt) or Miniforge Prompt. Do not mix this with standard Anaconda channels if possible.

# 1. Create a fresh environment using strictly conda-forge
# Python 3.11 is currently the sweet spot for stability
conda create -n harmonic_env -c conda-forge python=3.11 pymc nutpie arviz numpy scipy matplotlib jupyterlab

# 2. Activate the environment
conda activate harmonic_env

# 3. Install Harmonic via pip (builds using the VS C++ tools installed in Step 1)
pip install harmonic

The paper reference is available here:

Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator

I hope that people give this a try, as it should work with any PyMC model.

Thanks!

2 Likes

Thanks for sharing.

I skimmed through the paper and couldn’t figure what the algorithm actually looks like. There is some code with normalizing flows in the repo but no mention of them in the paper.

1 Like

I am familiar with the harmonic mean estimator of the marginal likelihood and the criticisms from Radford Neal and Christian Robert and have been looking for different ways to calculate this quantity. When I discovered this technique, I wanted to share with the community.

I took the opportunity to use Gemini to outline the paper and provide some additional details.

1. What the Algorithm Actually Does

The Learnt Harmonic Mean Estimator solves the infinite variance problem of the original harmonic mean estimator by treating it as an importance sampling problem. The algorithm consists of four distinct steps:

  1. Sample: You generate samples from your posterior using MCMC (e.g., using PyMC)
  2. Split: You divide these samples into a training set and an evaluation (inference).
  3. Learn: Using the training set, you train a machine learning model to learn a target distribution, \varphi(\theta) .
    Goal: This target tries to approximate the posterior but is explicitly constrained to have narrower tails than the posterior to minimize variance.
  4. Estimate: You calculate the marginal likelihood using the evaluation set and this newly learned target distribution.

2. Why Normalizing Flows are in the Repo

You are correct that the paper does not explicitly mention normalizing flows. The paper demonstrates the method using three specific models to learn the target distribution \varphi(\theta):

  • Hyperspheres: For simple, unimodal posteriors.
  • Modified Gaussian Mixture Models (MGMM): For multimodal distributions.
  • Kernel Density Estimation (KDE): For complex shapes like narrow curving degeneracies (e.g., the Rosenbrock function).

However, the paper explicitly states that “alternative more effective target models can be developed” and that the method simply requires any normalized distribution that approximates the posterior.

Normalizing flows are simply a more advanced, modern machine learning technique for approximating probability distributions. They were likely added to the repository as a more powerful alternative to KDE or GMMs for learning complex, high-dimensional target distributions, which fits perfectly into the “Learn” step of the framework described in the paper.

Just for clarification, I have no connections to the original authors so I’ve had to do a deep dive to get it working with PyMC.

This probably would be of interest to Stan users as well so if anyone is in that community, please repost.

I hope this helps and if there is any other follow up needed, I’m open to discussion.

Thanks!

1 Like

What do you need the marginal likelihood for? If you’re trying to use it to compare models with something like Bayes factors, you might also want to read Andrew Gelman and Aki Vehtari’s criticisms of Bayes factors (in the Bayesian Data Analysis book’s chapter on model compassion—there’s a free online pdf on the book’s home page) and consider something like leave-one-out cross-validation. Coincidentally, the approximate LOO you get in Arviz (from Andrew, Aki, and Yuling Yao), uses Pareto smoothing on the importance weights for the same reason you’re suggesting—to cut down potentially infinite and variance.

Thanks for the feedback! You are absolutely right that for standard models with independent data, LOO/WAIC are usually preferred over the Marginal Likelihood due to their robustness and focus on predictive accuracy.

However, my primary use case for harmonic is Spatial Models (e.g., Spatial Autoregressive models, Spatial Error models), where LOO and WAIC are often invalid or unavailable.

As far as I know, LOO assumes data points are conditionally independent. In spatial models, data are highly correlated; ‘leaving out’ one point doesn’t remove its information because it is shared with neighbors, leading to overly optimistic CV scores.

I’m working in a fairly obscure area but I wanted the community to be aware of the method.

You’re right—I don’t now that anyone has figured out how to do cross-validation with spatial data. For time series, you can use leave-future-out. Here’s the an R vignette that links to the Vehtari, Gelman, Gabry, and Bürkner paper:

Spatio-temporal models aren’t obscure! Geomed has 200+ participants every year who seem to be split between full Bayesian sampling methods (lots of NumPyro these days for the JAX connection) and nested Laplace approximations (largely with INLA).

We made a lot of use of ICAR spatial models crossed second-order random-walk models for modeling Covid prevalence during the pandemic with the UK government (I was the weak link in that paper—Tom is head of infectious disease modeling for the UK and Mitzi is now working for the parallel U.S. group in the CDC):

1 Like

Having said that, I’d turn to posterior predictive checks rather than something like Bayes factors to compare spatial models. The reason I don’t like Bayes factors is that they measure prior predictive inference rather than posterior predictive inference. I wrote a blog post about that on Gelman’s blog:

Thanks for the engagement on this!

I suspect the differing perspectives here stem from a “cross-disciplinary translation” issue between General Statistics/Biostatistics and Spatial Econometrics regarding what we classify as a “Spatial Model.”

My primary use case for this package involves Simultaneous Autoregressive (SAR) models (or “Spatial Lag” models). Unlike ICAR, these models posit a direct feedback loop where the outcome y depends on the neighbors’ outcomes Wy (i.e., y = \rho Wy + X\beta + \epsilon).

I appreciate the rigorous push towards LOO—it is absolutely the right default—but I wanted to clarify why this specific class of econometric models requires a different toolset.

1 Like

I’m pretty inclusive in thinking about spatio-temporal models. I think of a “spatial model” as anything that does spatial smoothing in its prior, with the main culprits being CAR (aka kriging in geology, but basically a GP) and its degenerate (in the sense of not being full rank and thus not properly generative) but more-efficient-to-compute cousin ICAR. Spatial models extend to things like image denoising or Ising (spin-glass) models. And at least the epidemiologists think of them as coming in areal form, like an ICAR model or a Poisson process (where you get observations for areas, like states or counties or zip codes), or in point form like Gaussian processes (where observations come with point locations like my geolocation right now). I think of time series the same way—anything that has a sequence-dependent prior, so it can be random walks to structured HMMs.

I hadn’t heard of simultaneous autoregressive models—am I reading that right that it gives you a recursive definition and you then have to stay constrained to solutions (which I would take to be fixed points of the function excluding the noise term)? If so, how do you do that in practice? My linear algebra’s just not up to this kind of thing.

Thanks for the opportunity to respond, as it gets at the heart of how different fields look at spatial modeling. You description of these models is very intuitive and although I’m familiar with them, they are not my primary focus.

I’m an economist and the simultaneous spatial models (SAR, etc.) are frequently used when we believe that the target or dependent variable is spatially autocorrelated. A canonical example would be house prices, which almost by definition are spatially correlated. Another example would be cross-border shopping, where individuals may shop across borders. Think of people crossing a border for a lower sales tax rate or availability of goods.

To answer your question: you are right that the definition looks recursive (y = \rho W y + X\beta + \epsilon), but in Spatial Econometrics, we don’t treat it as a fixed-point iteration. Instead, we solve the simultaneous system directly:

y = (I - \rho W)^{-1}(X\beta + \epsilon)

where W is an N \times N spatial weight matrix that codifies who is a neighbor to who. Because y appears on both sides, the transformation from \epsilon to y involves the Jacobian term |I - \rho W|. This adds a log-determinant term, \ln |I - \rho W|, to the log-likelihood function. I wont bore you with the details, but this add a bit of complexity when sampling.

If I’m thinking about this correctly, it’s effectively a Gaussian Process, but with a very specific, econometrically-motivated covariance structure derived from the spatial weights matrix W.

I’m working on the Bayesian estimation of these models in PyMC and have working versions of the code but I still need to stress test it at this point. I’m at the alpha/almost beta version of the code and when it’s done I’m going to mention it here.

Thanks again for the response.

1 Like

Nice, and thanks for explaining. I figured there was some linear algebra trick that’d make that fly. the ICAR models are the same but drop themselves out, which reduces compute time to linear in number of adjacent regions. So it really helps when trying to scale to 300 local authorities over 100 weeks, as we were doing with Covid in the UK.

You’ll find that this kind of spatial modeling shows up in just about every field under a different name. They call it “kriging” in geospatial statistics and “CAR” in epidemiology.

There are a lot of tricks for speeding up inference in Gaussian processes using Fourier transforms, depending on your goal, especially if they’re in 1 or 2 dimensions like a time series. Or by speeding them up using inducing points. For example, if you’re in 1D and only care about estimating means, you can scale to billions of points robustly with pretty standard kernels.

The very last grant on which I was a PI at Columbia was from Sloan and it was with Andrew Gelman with some pretty well known economists. We were doing a couple things in that grant, and I left before it really got going, but the main focus was on auction models and information equilibria, for which we needed differential algebraic equation solvers (like the tool builders we are, we were looking for an application of the DAE solver we added to Stan). So I’m familiar with the general issue. Coincidentally, the postdoc we hired for that project, Lu Zhang (now a professor at USC), did her Ph.D. on spatial modeling with Sudipto Banerjee (one of the leading experts in biostats spatial models)—her Ph.D. has a lot of interesting things involving scaling spatial GPs and I think the papers came out in Biometrika. Mitzi Morris, Andrew, and I also built spatial models with the UK government based on mobility data from Google (that’s the reference I linked), but we did it with simpler models than GPs. Andrew’s Ph.D. was actually on spatial models for image denoising. These models are ubiquitous.

2 Likes