Compute the KL divergence between two distributions


I would like to compute the Kullback-Leibler divergence between two distributions in PyMC3. I believe that the logp methods give me most of what I need, but I could use some help with the PyMC3 innards:

  1. How do I get the logp value for a sample (assignment of values to variables), as opposed to just a single var-value assignment? Is it as simple as invoking the logp method on all the variables and then summing? Is there a way to use a Deterministic random variable to get me the logp?

I see this in the discussion of the Model class:

Theano scalar of log-probability of the model

Theano scalar of log-probability of the model

I’m not sure why there are two names with the same description, and if or how I could use one of these methods to get a value. Could I assign model.logpt to a Deterministic variable?

  1. Is there a way to transfer an assignment of values to variables from one model to another? That is, if I could compute the logp for samples in a trace generated from model P_1 along the lines described above, could I somehow transfer the samples from P_1 to P_2 and then get the logp relative to P_2?

Thanks! I suspect that the answer is blindingly obvious to anyone with a better understanding of PyMC3, but I don’t see it myself.


I think I have at least partial answers to my questions for those who have the same one?

  1. The logp methods operate on points, so we can do things like this:
    model.logp(posterior_trace.point(i)).item() for i in range(X)
  1. For this I was interested in comparing results from a prior and a posterior model. So if I have a model constructor that optionally takes observations as arguments, I believe I can do the following:
    a. Create a prior model, m1
    b. Create a posterior model m2 that is like m1, but has some observed variables.
    c. AFAICT the points from a trace generated by either m1 or m2 can be evaluated using the logp method of either model. Indeed, if I understand @junpenglao 's argument elsewhere, the logp of these two models will be identical, and the only difference between the two will be in the normalizing constant (the logp being in general an un-normalized potential, instead of a probability).


Followup question: as the above suggests, I am able to apply the logp to elements of my traces. But… I get back logp values that are very, very small (between -16,000 and -2 * 10^144!

I was wondering if you had any suggestions about how I might go about rescaling these numbers so that I can deal with them.


Hmmm it seems there might be problem in your model, could it be possible that the VI approximation is poor so that when you evaluate the logp of the samples it gives you small values?


By “VI” you mean “variational inference,” correct?

I’m not sure what exactly is going wrong here, because I’m testing the logp of samples generated using sample_prior_predictive, so there isn’t a VI stage here, the way there is when I use sample with NUTS, is there?

I don’t know how to check the internals of the model to explain this, except to explain that it is a three-level hierarchical model, because it involves measurements of populations, and that the population measures (per other question) is the sum of independent Gaussians. I’ll see about putting the model into a notebook to share.

Is there any chance that the logp simply decays as the number of variables in the model grows, and that beyond a certain number of variables we need to rescale?

Is there any tracing mechanism that would let me take the logp calculation for a single point and print out all of the factors? That would show us if it’s decaying with size as I suggest.

Thanks for any suggestions!


Do I understand correctly that there should be a scaling factor applied to the logp, presumably to keep it from getting too small? If so, could it be that my model’s scaling factor is not computed correctly, so it is too small? I am digging around, and not sure yet, but it looks like the logp factors that get large are the ones corresponding to the output variables, which have a high number of observations. Could that be it?


HI, @junpenglao, I wonder if you can help us. We’re trying to compute the logp for a distribution and, as reported, get very high negative values.
As we look at the logp values from our model on a per-variable basis, we see that, as we reported earlier, there are some very large terms.
Looking at an example, we see this, which does seem to suggest a low probability sample:
Sampled hyper-parameters:

gamma_mu =  0.02099537805606402
gamma_sigma = 1.179636714844027e-06

which suggests that any sample from the gamma variable should be very close to 0.02. When we look at the sampled value, though, it’s pretty far:

gamma =  0.12212544319233962

Which is 847457.627118644 standard deviations from the mean!!!
So it’s clear why this logp value is so high.

But I’m boggled by this sample, which was generated by sample_prior_predictive. It’s wildly unlikely. For that matter the gamma_sigma value is quite unlikely also, I think. The mean of the gamma_sigma is 0.15 and the standard deviation is 0.05.

So I guess I’m wondering what is going on inside sample_prior_predictive and if it could possibly have an issue with our distribution.

When I use the conventional sample function on our model, it gives what looks like a reasonable sample. But… when I do a short chain of 5 samples, I get back 5 identical samples, which also seems buggy – we have no evidence, and we have an enormous probability space (three layers of parameters: hyper-hyper parameters, hyper-parameters, and parameters). So shouldn’t the sampler be hopping around?


Hmmm using these parameter you will get pretty weird prior where there is low probability almost everywhere. As:

mu =  0.02099537805606402
sd = 1.179636714844027e-06
alpha = mu**2 / sd**2
beta = mu / sd**2

alpha, beta

(316774953.93371433, 15087842337.862612)

You should check your Gamma prior and use alpha/beta parameterization.

As for the identical samples, you can hardly tell anything from 5 samples. Usually after tuning (ie, the sampler is in the typical set and the step size etc are well tuned) and your problem is well post (no high curvature places), samples are almost always different using NUTS. But if either of above is not satisfied you will see these kinds of stuck chain.


Thank you very much for this advice. We found the bug in the model and now it is working very nicely.

What we would like to do is compute the KL divergence between the prior and posterior model, but only for a subset of the variables. I think we almost have the answer, doing this:

logps = []
for point in trained_trace:
    point_logp = 0.0
    for variable in interesting_variables:
       point_logp += variable.logp(point)
return logp.mean()

our problem is defining interesting_variables – I tried the named_vars, removing the transformed variables, and then removing the ones that weren’t interesting. But this led to problems, because the model’s logp function delegates computation of some terms to the transformed variables. So I think the above loop needs something that removes the transformed variables, and from there delegates computation to the right transformed ones. Are those transformed ones only the ones for _log__? or also do we need to do something with the Bound ones?

Sorry – I accidentally hit the return too soon.


Yes, after sampling, you pass each slice of the trace to the model logp:

logp = model.logp
for point in trace:

Note that prior samples are in a dict of arrays, so you need to slice it manually.


If we want to do this variable-wise as in my example code (sorry for messing it up), I think we need to do something like check and see if variable.logp has a transformed variable and, if so, get the logp from that variable instead.

Sorry if this is a dumb question, but I’m having trouble finding the real logp code, because there are the multiple layers of optimization needed to make it work efficiently.


Here’s some code we were using to try to figure this out:

def display_factored_point_logp(m: SimpleModel,
                                point: Dict[str, np.ndarray]) -> None:
        '''Display the by-variable factors of the logp function of point.'''
        with m.model:
            good_varnames = filtered_vars(m)
            for (name, RV) in m.named_vars.items():
                if name in good_varnames:
                            lp = RV.logp(point)
                            print(name, lp)
                    except AttributeError:
                            print('{0} has no logp value.'.format(name))

The problem is that we would get a lot of these “has no logp value” errors. I think the right way to handle these is to find a transformed variable for the logp, but I’m not sure how to do it.


You should evaluate the logp on the transformed space, first cache the logp: logp_func = [var.logp for var in m.free_RVs], and in the for loop of the trace, run another for loop of the logp_func and evaluate the point of each free_RVs