I have been struggling to implement a simple idea I have for penalizing the likelihood of a model if the posterior of a parameter calculated on a small sample differs from the posterior of that same parameter calculated on a very large sample. I want to do this in order to regularize the posterior estimation on the small sample (basically, let the model fit the data if it really is a good fit, otherwise - stick to the posterior estimated on the very large sample).
First, I fit a simple model on a sample:
import pymc as pm
import numpy as np
# Simulated large sample, 10000 data points
sample_1 = np.random.randn(10000)
# --- STEP 1: Fit ADVI on the first sample ---
with pm.Model() as model_1:
param = pm.Normal("param", mu=0, sigma=1, shape=5)
obs = pm.Normal("obs", mu=param.sum(), sigma=1, observed=sample_1)
# Fit using ADVI
approx_1 = pm.fit(method="advi", n=10000)
# Extract log-probability function for the first posterior
logp_first_posterior = approx_1.logp_norm
Next, I want to fit a model with the same parameters on a second sample, but I am using the KL divergence to make sure that the posterior of that parameter in the second sample does not differ too much from the posterior I got from the first sample. My code looks like this:
# Simulated small sample, 100 data points
sample_2 = np.random.randn(100)
# --- STEP 2: Fit ADVI on the second dataset with KL penalty ---
with pm.Model() as model_2:
param = pm.Normal("param", mu=0, sigma=1, shape=5)
obs = pm.Normal("obs", mu=param.sum(), sigma=1, observed=sample_2)
# Compute KL divergence
log_p_first = logp_first_posterior({"param": param})
# Problematic line of code follows
log_p_second = pm.Normal.dist(0, 1).logp(param).sum()
kl_div = (log_p_first - log_q_second).sum()
# Add potential to penalize divergence
pm.Potential("kl_penalty", -kl_div)
# Fit the second model
approx_2 = pm.fit(method="advi", n=10000)
Now, this works well (as well as some other approaches and ideas I have tried), but it does not completely capture my idea. As you can see in the “problematic line of code”, instead of calculating the posterior logp of param on the second sample and comparing it with the posterior logp derived from the first sample, I am calculating the prior logp of param and comparing that with the posterior logp derived from the first sample. Is there a way to somehow use the posterior logp here as well?
Another thing that I was interested in was your opinion of the approach (penalizing the likelihood with the KL divergence between the two posteriors in order to try to get similar posteriors from the second model). Do you think there is a better way to do this?