Any interest in coresets / data summarization techniques?

I just watched a great talk by Tamara Broderick: Automated Scalable Bayesian Inference via Data Summarization. In the first part of the talk she explains variational Bayes (and its limitations) and in the second she explains how if the dataset can be reduced in size using summarization techniques, that traditional MCMC can be used on big data problems.

Here is a paper she references regarding the latest summarization technique from her research group: Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent. There is even a github repo.

What are your thoughts on this approach? Is there interest in these techniques or are they too new / unproven?

I have been using pymc3 on some relatively small datasets and I am getting started now on some much larger ones and have been surveying the literature on the best techniques. It seems like ADVI is the state of the art and minibatch ADVI in particular seems to scale well to huge datasets. However the mean-field (and full-rank) ADVI techniques are approximations that are not easy to improve upon (they either work or do not), while data summarization techniques have a natural tuning knob (number of samples) that can be used to increase the accuracy.

Never heard of this but looks interesting on a first glance. There is definitely interest in this, so I’d encourage you to try it out and if you find that it works well, work on a PR to PyMC3.