Hi!
My name is Yelysei, I’m a Master student in Computational Science and Engineering at TU Munich.
I would really like to take Dirichlet Processes to the next level in PyMC this summer.
Some relevant background: In my Bachelor thesis I did an overview of Bayesian Methods in ML, where I also used some models (PPCA, BPCA) from PyMC. I have >1 year experience implementing custom models in TensorFlow, such as energy-based models and Discrete Variational Autoencoders. In particular, I learned about DPs and HDPs from this paper and was amazed by how cool they are. My github.
I’ve already setup dev environment and merged a little PR. I carefully read the dev guide and more or less got the feel of the core infrastructure, and of course, saw the DP notebook.
Some thoughts about what can be done:
- perhaps the first thing is to encapsulate stick-breaking process, something like
DirichletProcess
fromedward
, which can also be useful for the upcoming pymc4 - then develop other sampling algorithms, that can have a dynamically growing number of mixture components, as mentioned in the DP notebook (Gibbs sampling, Stochastic Memoization)
- add Hierarchical DPs, as also mentioned in issue #1748
- implement online/mini-batch HDP, that can be useful for large corpora of data (LDA notebook might be relevant)
- add (specialized) variational inference for DP mixtures
What do you think? How is best to proceed? Any comments are appreciated.