Speeding up Bayesian Modelling - HelloFresh blogpost

Hi all!

Recently I was reading this article from the the PyMC Labs blog

In short: the article discusses how the PyMC team achieved massive speedups when doing Bayesian modeling for the purpose of A/B tests. According to the article, the main trick was to use one unpooled model and to fit all of the datasets simultaneously, instead of doing inference on many datasets sequentially.

My question is: I don’t quite understand where the performance speedup is coming from. Given that the context is AB tests, I would naively assume that each dataset has anywhere between 100_000 and 1_000_000 samples. Combining say… 10 or so such datasets… wouldn’t that make the sampling considerably slower than just sampling each dataset independently?

I did play around with a toy example I quickly built together with some generated fake data, and it looked like doing sequential modeling of the datasets is quite a bit faster compared to one unpooled model, at least on my (pretty standard) laptop. Happy to share code if needed. But pretty standard… Gaussian distribution with a couple of prior distribution for the mean (Gaussian) and stdev (Uniform) - nothing fancy.

What am I missing here - is it my limited knowledge & understanding of the PyMC library, or perhaps something more fundamental about Bayesian inference.

I understand that the article in question described paid work by the authors so it is totally cool if no more info on this can be shared - I was just curious :slight_smile:

Also - big thanks to the authors of the blogs, i very much enjoy reading them and really like the language and the style used.

Cheers!

1 Like

Hi @Jovan, glad you liked the blog post.

Not quite sure if this will answer your question, but we can imagine two ends of a spectrum.

So depending where we are on this spectrum, different approaches produce different speed-ups.

Not sure if that helps? Feel free to DM me if you think an engagement with PyMC Labs might help for your particular problem :slightly_smiling_face:

Hi @drbenvincent

Thank you for the prompt reply!

Yes I think your reply cleared things up for me: my assumption of the data size was wrong :). I (natively) thought that combining the multiple datasets from the HalloFresh blog-post would get one well into the territory of the 2nd blog post that you mentioned, hence my confusion.

Yeah sure, I would keep that in mind. I would love to support this project if I can!