# Compute the KL divergence between two distributions

I would like to compute the Kullback-Leibler divergence between two distributions in PyMC3. I believe that the `logp` methods give me most of what I need, but I could use some help with the PyMC3 innards:

1. How do I get the `logp` value for a sample (assignment of values to variables), as opposed to just a single var-value assignment? Is it as simple as invoking the `logp` method on all the variables and then summing? Is there a way to use a `Deterministic` random variable to get me the `logp`?

I see this in the discussion of the `Model` class:

logp_nojact
Theano scalar of log-probability of the model

logpt
Theano scalar of log-probability of the model

Iâ€™m not sure why there are two names with the same description, and if or how I could use one of these methods to get a value. Could I assign `model.logpt` to a `Deterministic` variable?

1. Is there a way to transfer an assignment of values to variables from one model to another? That is, if I could compute the logp for samples in a trace generated from model P_1 along the lines described above, could I somehow transfer the samples from P_1 to P_2 and then get the logp relative to P_2?

Thanks! I suspect that the answer is blindingly obvious to anyone with a better understanding of PyMC3, but I donâ€™t see it myself.

I think I have at least partial answers to my questions for those who have the same one?

1. The `logp` methods operate on points, so we can do things like this:
``````    model.logp(posterior_trace.point(i)).item() for i in range(X)
``````
1. For this I was interested in comparing results from a prior and a posterior model. So if I have a model constructor that optionally takes observations as arguments, I believe I can do the following:
a. Create a prior model, `m1`
b. Create a posterior model `m2` that is like `m1`, but has some observed variables.
c. AFAICT the points from a trace generated by either `m1` or `m2` can be evaluated using the `logp` method of either model. Indeed, if I understand @junpenglao 's argument elsewhere, the `logp` of these two models will be identical, and the only difference between the two will be in the normalizing constant (the `logp` being in general an un-normalized potential, instead of a probability).
1 Like

Followup question: as the above suggests, I am able to apply the `logp` to elements of my traces. Butâ€¦ I get back `logp` values that are very, very small (between -16,000 and -2 * 10^144!

I was wondering if you had any suggestions about how I might go about rescaling these numbers so that I can deal with them.

Hmmm it seems there might be problem in your model, could it be possible that the VI approximation is poor so that when you evaluate the logp of the samples it gives you small values?

By â€śVIâ€ť you mean â€śvariational inference,â€ť correct?

Iâ€™m not sure what exactly is going wrong here, because Iâ€™m testing the `logp` of samples generated using `sample_prior_predictive`, so there isnâ€™t a VI stage here, the way there is when I use `sample` with NUTS, is there?

I donâ€™t know how to check the internals of the model to explain this, except to explain that it is a three-level hierarchical model, because it involves measurements of populations, and that the population measures (per other question) is the sum of independent Gaussians. Iâ€™ll see about putting the model into a notebook to share.

Is there any chance that the `logp` simply decays as the number of variables in the model grows, and that beyond a certain number of variables we need to rescale?

Is there any tracing mechanism that would let me take the `logp` calculation for a single point and print out all of the factors? That would show us if itâ€™s decaying with size as I suggest.

Thanks for any suggestions!

Do I understand correctly that there should be a scaling factor applied to the `logp`, presumably to keep it from getting too small? If so, could it be that my modelâ€™s scaling factor is not computed correctly, so it is too small? I am digging around, and not sure yet, but it looks like the `logp` factors that get large are the ones corresponding to the output variables, which have a high number of observations. Could that be it?

HI, @junpenglao, I wonder if you can help us. Weâ€™re trying to compute the logp for a distribution and, as reported, get very high negative values.
As we look at the `logp` values from our model on a per-variable basis, we see that, as we reported earlier, there are some very large terms.
Looking at an example, we see this, which does seem to suggest a low probability sample:
Sampled hyper-parameters:

``````gamma_mu =  0.02099537805606402
gamma_sigma = 1.179636714844027e-06
``````

which suggests that any sample from the `gamma` variable should be very close to 0.02. When we look at the sampled value, though, itâ€™s pretty far:

``````gamma =  0.12212544319233962
``````

Which is 847457.627118644 standard deviations from the mean!!!
So itâ€™s clear why this `logp` value is so high.

But Iâ€™m boggled by this sample, which was generated by `sample_prior_predictive`. Itâ€™s wildly unlikely. For that matter the `gamma_sigma` value is quite unlikely also, I think. The mean of the `gamma_sigma` is 0.15 and the standard deviation is 0.05.

So I guess Iâ€™m wondering what is going on inside `sample_prior_predictive` and if it could possibly have an issue with our distribution.

When I use the conventional `sample` function on our model, it gives what looks like a reasonable sample. Butâ€¦ when I do a short chain of 5 samples, I get back 5 identical samples, which also seems buggy â€“ we have no evidence, and we have an enormous probability space (three layers of parameters: hyper-hyper parameters, hyper-parameters, and parameters). So shouldnâ€™t the sampler be hopping around?

Hmmm using these parameter you will get pretty weird prior where there is low probability almost everywhere. As:

``````mu =  0.02099537805606402
sd = 1.179636714844027e-06
alpha = mu**2 / sd**2
beta = mu / sd**2

alpha, beta

(316774953.93371433, 15087842337.862612)
``````

You should check your Gamma prior and use alpha/beta parameterization.

As for the identical samples, you can hardly tell anything from 5 samples. Usually after tuning (ie, the sampler is in the typical set and the step size etc are well tuned) and your problem is well post (no high curvature places), samples are almost always different using NUTS. But if either of above is not satisfied you will see these kinds of stuck chain.

Thank you very much for this advice. We found the bug in the model and now it is working very nicely.

What we would like to do is compute the KL divergence between the prior and posterior model, but only for a subset of the variables. I think we almost have the answer, doing this:

``````logps = []
for point in trained_trace:
point_logp = 0.0
for variable in interesting_variables:
point_logp += variable.logp(point)
logps.append(point_logp)
return logp.mean()
``````

our problem is defining `interesting_variables` â€“ I tried the `named_vars`, removing the transformed variables, and then removing the ones that werenâ€™t interesting. But this led to problems, because the modelâ€™s `logp` function delegates computation of some terms to the transformed variables. So I think the above loop needs something that removes the transformed variables, and from there delegates computation to the right transformed ones. Are those transformed ones only the ones for `_log__`? or also do we need to do something with the `Bound` ones?

Yes, after sampling, you pass each slice of the trace to the model logp:

``````logp = model.logp
for point in trace:
lp.append(logp(point))
``````

Note that prior samples are in a dict of arrays, so you need to slice it manually.

If we want to do this variable-wise as in my example code (sorry for messing it up), I think we need to do something like check and see if `variable.logp` has a transformed variable and, if so, get the logp from that variable instead.

Sorry if this is a dumb question, but Iâ€™m having trouble finding the real `logp` code, because there are the multiple layers of optimization needed to make it work efficiently.

Hereâ€™s some code we were using to try to figure this out:

``````def display_factored_point_logp(m: SimpleModel,
point: Dict[str, np.ndarray]) -> None:
'''Display the by-variable factors of the logp function of point.'''
with m.model:
good_varnames = filtered_vars(m)
for (name, RV) in m.named_vars.items():
if name in good_varnames:
try:
lp = RV.logp(point)
print(name, lp)
except AttributeError:
print('{0} has no logp value.'.format(name))
return
``````

The problem is that we would get a lot of these â€śhas no logp valueâ€ť errors. I think the right way to handle these is to find a transformed variable for the logp, but Iâ€™m not sure how to do it.

You should evaluate the logp on the transformed space, first cache the logp: `logp_func = [var.logp for var in m.free_RVs]`, and in the for loop of the trace, run another for loop of the `logp_func` and evaluate the point of each free_RVs