Regularizing the posterior of a parameter with the KL divergence from a different posterior

I have been struggling to implement a simple idea I have for penalizing the likelihood of a model if the posterior of a parameter calculated on a small sample differs from the posterior of that same parameter calculated on a very large sample. I want to do this in order to regularize the posterior estimation on the small sample (basically, let the model fit the data if it really is a good fit, otherwise - stick to the posterior estimated on the very large sample).

First, I fit a simple model on a sample:

import pymc as pm
import numpy as np

# Simulated large sample, 10000 data points
sample_1 = np.random.randn(10000)

# --- STEP 1: Fit ADVI on the first sample ---
with pm.Model() as model_1:
    param = pm.Normal("param", mu=0, sigma=1, shape=5)
    obs = pm.Normal("obs", mu=param.sum(), sigma=1, observed=sample_1)

    # Fit using ADVI
    approx_1 = pm.fit(method="advi", n=10000)

# Extract log-probability function for the first posterior
logp_first_posterior = approx_1.logp_norm

Next, I want to fit a model with the same parameters on a second sample, but I am using the KL divergence to make sure that the posterior of that parameter in the second sample does not differ too much from the posterior I got from the first sample. My code looks like this:

# Simulated small sample, 100 data points
sample_2 = np.random.randn(100)

# --- STEP 2: Fit ADVI on the second dataset with KL penalty ---
with pm.Model() as model_2:
    param = pm.Normal("param", mu=0, sigma=1, shape=5)
    obs = pm.Normal("obs", mu=param.sum(), sigma=1, observed=sample_2)

    # Compute KL divergence
    log_p_first = logp_first_posterior({"param": param})
    # Problematic line of code follows
    log_p_second = pm.Normal.dist(0, 1).logp(param).sum()
    kl_div = (log_p_first - log_q_second).sum()

    # Add potential to penalize divergence
    pm.Potential("kl_penalty", -kl_div)

# Fit the second model
approx_2 = pm.fit(method="advi", n=10000)

Now, this works well (as well as some other approaches and ideas I have tried), but it does not completely capture my idea. As you can see in the “problematic line of code”, instead of calculating the posterior logp of param on the second sample and comparing it with the posterior logp derived from the first sample, I am calculating the prior logp of param and comparing that with the posterior logp derived from the first sample. Is there a way to somehow use the posterior logp here as well?

Another thing that I was interested in was your opinion of the approach (penalizing the likelihood with the KL divergence between the two posteriors in order to try to get similar posteriors from the second model). Do you think there is a better way to do this?

Maybe one method could be to actually use pm.sample on the first step and then use the posterior samples obtained from the sampling as the prior in the next fit or sample or whatever you want to do? As far as I know, using multidimensional posteriors as priors is not completely straightforward but there is a nice discussion here:

I am actually using prior_from_idata to reuse the posterior from the first model as a prior for the second model, I just didn’t want to complicate the example code here. My goal is to regularize the posterior of the second model even further, because just using the posterior from the first model as a prior for the second model still resulted in a posterior that diverged too much from the posterior of the first model.

What you want is called a hierarchical model—I’d call it the bread-and-butter of Bayesian modeling. It doesn’t need KL divergence. Here’s a simple PyMC example:

and here’s my longer Stan example along the same lines and using the same classic Efron and Morris data:

The usual way to control the posterior is through the prior, unless you want to change the likelihood.

2 Likes