This is more of a general question rather than specific pymc3 question. I appreciate if someone can help me to understand this.
I have observed that even though ADVI and MCMC posterior distributions have their mode at same point, their variance is significantly different. Usually MCMC has a high variance whereas ADVI have a low variance.
I know that MCMC draws samples from the exact target distribution and where as ADVI try to minimize the KL divergence between a proposed distribution and the target distribution. So is it safe to say that MCMC estimate the variance of the target distribution accurately?
ADVI (the mean-field version) often shows “mode-seeking” behavior, where the estimated posterior sticks to one of the modes of the real posterior. So yes, it isn’t surprising that it estimates a lower variance than MCMC. Now whether MCMC has “accurately” captured variance is a very tough question to answer - you likely have to do posterior predictive checks to get a handle on that.
That being said, doing full-rank ADVI should give you a better estimate of the posterior variance than mean-field.
It is a well-observed behaviour of ADVI. Quoting from Dan Simpsons below:
For mean-field Gaussian you’re approximating family is a product of Gaussians on the two axes, which, for example, can’t approximate a narrow Gaussian concentrated around the line y=x.
For the full rank one, I’d expect it to be in the correct place, but the covariance matrix to be too “concentrated”. This is because the KL divergence is an asymmetric measure of “distance” between two probability distributions and in the direction that it is used for VI, it penalises approximations that are too diffuse far more fiercely than approximations that are too concentrated. This leads to a systematic underestimation of variation using VB methods.
Of course, it is not saying that VI is always only mode seeding, for example see in Kevin Murphy’s Machine learning a probabilistic perspective:
Thanks guys, this is really helpful.