Adding a new operator for an objective which involves gradients


I’m writing to ask for advice/suggestions on creating a new operator class for operator VI in pymc3.
Specifically, I’d like to use the squared L2 norm of the difference between the score function of the posterior and the variational approximation:

\mathcal{L}(q,p)= \mathbb{E}_{\theta \sim q(\theta)}[\| \nabla_\theta \log p(\theta)p(x|\theta) - \nabla_\theta \log q(\theta)\|_2^2]

Where \theta are the free (latent) random variables for which I’d like to fit a variational approximation, and x are observed variables.

This has a similar flavor to the usual ELBO, \mathbb{E}_{\theta \sim q(\theta)}[ \log p(\theta)p(x|\theta) - \log q(\theta)], so I was hoping this would be straightforward to implement [modeling after “KL(Operator)” in variational/, and rewriting “init()” and “apply()” for the new class]. However, I’ve been getting lost trying to keep track of where the relevant variables are and properly computing the necessary gradients.

More specifically, a first pass was to set self.approx=approx in __init__, and in apply compute the objective as

dlogp = tt.grad(self.approx.datalogp+self.varlogp, self.approx.symbolic_randoms)
dlogq = tt.grad(self.approx.logq, self.approx.symbolic_randoms)
score_diff = [dlogp_i - dlogq_i for (dlogp_i, dlogp_i) in zip(dlogp, dlogq)]
return sum(tt.sum(tensor**2) for tensor in score_diff)

I then tried to use this in essentially the same way that KL is used for ADVI. However, this gives me a “DisconnectedInputError” in the second line above, which I think may be due calling tt.grad on variables from different cloned (sub-)graphs.

Any suggestions on how to proceed with this would be much appreciated, even if just pointers to relevant documentation.

In case it’s relevant, I’ve used Theano in past but am new to pymc3 (I mostly have used Tensorflow probability and Stan).

Many thanks in advance!

@ferrine should be able to help you. But likely it is related to clone when you are setting up approx.logq

Also, side note you should replace the two for loop with theano.scan for performance.

Sure I can help. I will find some time soon to collaborate. BTW, do you plan opening a PR?

@ferrine Great! I haven’t decided yet. Preliminary results (coded on my own) seem to indicate that this may offer some advantages over ADVI with KL, but it’s unclear to me at this point if these warrant sharing.

I also don’t have much sense of how high the bar is for contributing this sort of thing. Do you have thoughts on this?

@junpenglao, thanks for the suggestion to switch over to scan!


Did you try to calculate \|\nabla_\theta\left[\log p(D|\theta) + \log p(\theta) - \log q(\theta)\right]\|_2^2 instead? Differential operator is linear and this enables this a bit more lightweight operation.

PS not yet coded this, but it seems to be a more efficient solution

  • you’d better use normed tensors
diff_grad = tt.grad(self.logp_norm  - self.logq_norm, self.symbolic_randoms)

Thanks for the response! I had not tried this and it seems to work (with the small change of using self.approx.logp_norm / self.approx.logq_norm / self.approx.symbolic_randoms)!

Much appreciated! I still don’t quite understand why this works, but that is certainly not immediately problematic.

Many thanks again!

1 Like

Do you have a demo notebook with your study? This would be a cool tutorial to extending OPVI in pymc3

Interesting to see how does it work with Normalizing Flows here:

I do have a rough notebook testing this. Still in preliminary stages of evaluating though, so am not sure how worthwhile an extension it is; I’d be glad to share a tidied up notebook if you’re interested!

Is there a reason why you’re interested in seeing how normalising flows work with this operator in particular? I agree though that normalising flows are interesting in general.

Working with normalizing flows a bit I got an impression that it is quite tricky to make them work very well. KL operator is mode seeking and I have concerns of how does this affect in a setting, where both the posterior and approximate distribution can be multimodal. I did not experiment with F-Divergence, however (lack of time). It seems to me, that the operator you implement (btw is there any link to the corresponding paper?) can have different properties and thus behave differently in this case.

I agree that normalising flows can be difficult to work with.

From preliminary tests a little while back, this approximation appears to be mode-seeking as well.

Sure, the most relevant references from my perspective are:
Huggins, Jonathan H., et al. “Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach.” arXiv preprint arXiv:1809.09505 (2018).
Hyvärinen, Aapo. “Estimation of non-normalized statistical models by score matching.” Journal of Machine Learning Research 6.Apr (2005): 695-709.

We have not published anything on using this for automated variational inference.