Working with normalizing flows a bit I got an impression that it is quite tricky to make them work very well. KL operator is mode seeking and I have concerns of how does this affect in a setting, where both the posterior and approximate distribution can be multimodal. I did not experiment with F-Divergence, however (lack of time). It seems to me, that the operator you implement (btw is there any link to the corresponding paper?) can have different properties and thus behave differently in this case.