Adding a new operator for an objective which involves gradients

Brian_Trippe · March 27, 2019, 3:27pm

Hello!

I’m writing to ask for advice/suggestions on creating a new operator class for operator VI in pymc3.
Specifically, I’d like to use the squared L2 norm of the difference between the score function of the posterior and the variational approximation:

\mathcal{L}(q,p)= \mathbb{E}_{\theta \sim q(\theta)}[\| \nabla_\theta \log p(\theta)p(x|\theta) - \nabla_\theta \log q(\theta)\|_2^2]

Where \theta are the free (latent) random variables for which I’d like to fit a variational approximation, and x are observed variables.

This has a similar flavor to the usual ELBO, \mathbb{E}_{\theta \sim q(\theta)}[ \log p(\theta)p(x|\theta) - \log q(\theta)], so I was hoping this would be straightforward to implement [modeling after “KL(Operator)” in variational/operators.py, and rewriting “init()” and “apply()” for the new class]. However, I’ve been getting lost trying to keep track of where the relevant variables are and properly computing the necessary gradients.

More specifically, a first pass was to set self.approx=approx in __init__, and in apply compute the objective as

dlogp = tt.grad(self.approx.datalogp+self.varlogp, self.approx.symbolic_randoms)
dlogq = tt.grad(self.approx.logq, self.approx.symbolic_randoms)
score_diff = [dlogp_i - dlogq_i for (dlogp_i, dlogp_i) in zip(dlogp, dlogq)]
return sum(tt.sum(tensor**2) for tensor in score_diff)

I then tried to use this in essentially the same way that KL is used for ADVI. However, this gives me a “DisconnectedInputError” in the second line above, which I think may be due calling tt.grad on variables from different cloned (sub-)graphs.

Any suggestions on how to proceed with this would be much appreciated, even if just pointers to relevant documentation.

In case it’s relevant, I’ve used Theano in past but am new to pymc3 (I mostly have used Tensorflow probability and Stan).

Many thanks in advance!

junpenglao · March 28, 2019, 8:56am

@ferrine should be able to help you. But likely it is related to clone when you are setting up approx.logq

Also, side note you should replace the two for loop with theano.scan for performance.

ferrine · March 28, 2019, 10:30am

Sure I can help. I will find some time soon to collaborate. BTW, do you plan opening a PR?

Brian_Trippe · March 28, 2019, 2:23pm

@ferrine Great! I haven’t decided yet. Preliminary results (coded on my own) seem to indicate that this may offer some advantages over ADVI with KL, but it’s unclear to me at this point if these warrant sharing.

I also don’t have much sense of how high the bar is for contributing this sort of thing. Do you have thoughts on this?

@junpenglao, thanks for the suggestion to switch over to scan!

-Brian

ferrine · April 1, 2019, 1:45pm

Did you try to calculate \|\nabla_\theta\left[\log p(D|\theta) + \log p(\theta) - \log q(\theta)\right]\|_2^2 instead? Differential operator is linear and this enables this a bit more lightweight operation.

PS not yet coded this, but it seems to be a more efficient solution

you’d better use normed tensors

diff_grad = tt.grad(self.logp_norm  - self.logq_norm, self.symbolic_randoms)

Brian_Trippe · April 1, 2019, 3:08pm

Thanks for the response! I had not tried this and it seems to work (with the small change of using self.approx.logp_norm / self.approx.logq_norm / self.approx.symbolic_randoms)!

Much appreciated! I still don’t quite understand why this works, but that is certainly not immediately problematic.

Many thanks again!

ferrine · April 1, 2019, 7:53pm

Do you have a demo notebook with your study? This would be a cool tutorial to extending OPVI in pymc3

ferrine · April 2, 2019, 8:16am

Interesting to see how does it work with Normalizing Flows here: https://github.com/pymc-devs/pymc3/blob/master/docs/source/notebooks/normalizing_flows_overview.ipynb

Brian_Trippe · April 3, 2019, 2:44am

I do have a rough notebook testing this. Still in preliminary stages of evaluating though, so am not sure how worthwhile an extension it is; I’d be glad to share a tidied up notebook if you’re interested!

Is there a reason why you’re interested in seeing how normalising flows work with this operator in particular? I agree though that normalising flows are interesting in general.

ferrine · April 3, 2019, 1:56pm

Working with normalizing flows a bit I got an impression that it is quite tricky to make them work very well. KL operator is mode seeking and I have concerns of how does this affect in a setting, where both the posterior and approximate distribution can be multimodal. I did not experiment with F-Divergence, however (lack of time). It seems to me, that the operator you implement (btw is there any link to the corresponding paper?) can have different properties and thus behave differently in this case.

Brian_Trippe · April 3, 2019, 3:56pm

I agree that normalising flows can be difficult to work with.

From preliminary tests a little while back, this approximation appears to be mode-seeking as well.

Sure, the most relevant references from my perspective are:
Huggins, Jonathan H., et al. “Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach.” arXiv preprint arXiv:1809.09505 (2018).
and,
Hyvärinen, Aapo. “Estimation of non-normalized statistical models by score matching.” Journal of Machine Learning Research 6.Apr (2005): 695-709.

We have not published anything on using this for automated variational inference.

Topic		Replies	Views
Regularization for AEVB Questions	2	604	June 26, 2018
Bayesian linear regression with SVGD Questions	4	799	January 21, 2021
Black Box Variational Inference (using Score Gradient) Questions	3	745	October 7, 2019
Applying an MCMC Step to Sample from Distribution with KL Divergence Term v5 variational_inferenc	11	919	December 19, 2022
Matrix operations involving observed and latent variables v3	9	918	March 14, 2022

Adding a new operator for an objective which involves gradients

Related topics