Hi all!
Recently I was reading this article from the the PyMC Labs blog
In short: the article discusses how the PyMC team achieved massive speedups when doing Bayesian modeling for the purpose of A/B tests. According to the article, the main trick was to use one unpooled model and to fit all of the datasets simultaneously, instead of doing inference on many datasets sequentially.
My question is: I don’t quite understand where the performance speedup is coming from. Given that the context is AB tests, I would naively assume that each dataset has anywhere between 100_000 and 1_000_000 samples. Combining say… 10 or so such datasets… wouldn’t that make the sampling considerably slower than just sampling each dataset independently?
I did play around with a toy example I quickly built together with some generated fake data, and it looked like doing sequential modeling of the datasets is quite a bit faster compared to one unpooled model, at least on my (pretty standard) laptop. Happy to share code if needed. But pretty standard… Gaussian distribution with a couple of prior distribution for the mean (Gaussian) and stdev (Uniform) - nothing fancy.
What am I missing here - is it my limited knowledge & understanding of the PyMC library, or perhaps something more fundamental about Bayesian inference.
I understand that the article in question described paid work by the authors so it is totally cool if no more info on this can be shared - I was just curious
Also - big thanks to the authors of the blogs, i very much enjoy reading them and really like the language and the style used.
Cheers!