Speeding up Bayesian Modelling - HelloFresh blogpost

Jovan · September 10, 2022, 10:44pm

Hi all!

Recently I was reading this article from the the PyMC Labs blog

In short: the article discusses how the PyMC team achieved massive speedups when doing Bayesian modeling for the purpose of A/B tests. According to the article, the main trick was to use one unpooled model and to fit all of the datasets simultaneously, instead of doing inference on many datasets sequentially.

My question is: I don’t quite understand where the performance speedup is coming from. Given that the context is AB tests, I would naively assume that each dataset has anywhere between 100_000 and 1_000_000 samples. Combining say… 10 or so such datasets… wouldn’t that make the sampling considerably slower than just sampling each dataset independently?

I did play around with a toy example I quickly built together with some generated fake data, and it looked like doing sequential modeling of the datasets is quite a bit faster compared to one unpooled model, at least on my (pretty standard) laptop. Happy to share code if needed. But pretty standard… Gaussian distribution with a couple of prior distribution for the mean (Gaussian) and stdev (Uniform) - nothing fancy.

What am I missing here - is it my limited knowledge & understanding of the PyMC library, or perhaps something more fundamental about Bayesian inference.

I understand that the article in question described paid work by the authors so it is totally cool if no more info on this can be shared - I was just curious

Also - big thanks to the authors of the blogs, i very much enjoy reading them and really like the language and the style used.

Cheers!

drbenvincent · September 11, 2022, 1:47pm

Hi @Jovan, glad you liked the blog post.

Not quite sure if this will answer your question, but we can imagine two ends of a spectrum.

On one end we have many A/B tests each with modest numbers of observations. This was pretty much the situation in the blog post you mentioned.
On the other end we have one (or a few) A/B tests, each with many observations. We also addressed this with a different (unnamed) client written up here Bayesian inference at scale: Running A/B tests with millions of observations - PyMC Labs

So depending where we are on this spectrum, different approaches produce different speed-ups.

Not sure if that helps? Feel free to DM me if you think an engagement with PyMC Labs might help for your particular problem

Jovan · September 11, 2022, 8:11pm

Hi @drbenvincent

Thank you for the prompt reply!

Yes I think your reply cleared things up for me: my assumption of the data size was wrong :). I (natively) thought that combining the multiple datasets from the HalloFresh blog-post would get one well into the territory of the 2nd blog post that you mentioned, hence my confusion.

Yeah sure, I would keep that in mind. I would love to support this project if I can!

Topic		Replies	Views
Bayesian network inference speed in PyMC	5	482	April 11, 2024
New software for sample-efficient Bayesian inference (PyVBMC) Sharing	2	378	April 6, 2023
Slow initialization? (First time user)	3	429	May 13, 2022
Fast but now slow sample speeds (MacOS) Questions	2	967	February 1, 2021
Model selection for models with time-consuming sampling and large datasets version agnostic	0	387	February 2, 2023

Speeding up Bayesian Modelling - HelloFresh blogpost

Related topics