I guess this effect can be explained in the following way:
- you say your posterior has 2 modes
- you actually sample from one mode
- in KL both modes are covered
- you do not care what mode you are sampling from
Considering the minibatch dataset, did you specify total_size for your observed variable?