Implementing the DirichletMultinomial distribution in PyMC3

Can anyone explain what this one is supposed to test? I can’t wrap my head around it. Looks like the multinomial version comes from #4169

I’m going to move this discussion back to the PR thread on GH so I can get feedback on that.

@bsmith89 That is great news. Feel free to push all the changes to the fork in the PR. I am sorry to be working just like a middle men, when I reopened the PR I thought you had lost interest in it :b But it is absolutely great to see the progress you are doing! I will try to at least review the changes you did.

I did some research on the Multinomial mode issue in the GH repo. Indeed it seems the goal was to just have a good enough approximation for the starting point during sampling. I agree that users should be warned that it can be inacurate but I have no idea how this could be achieved nicely. A correct iterative algorithm sounds inpractical given that it would need to support all possible batches/ shapes and have to resort to a contrived Theano.scan somewhere. In any case, for our purposes the approximation should be equally good for the DM

Nothing to feel bad about! I had gotten bogged down with all the shape issues the first time around, and didn’t understand most of the Multinomial code/tests, so I didn’t feel like I could get it into shape to be merged.

It’s been very satisfying to push through those issues this time (even if it’s mostly just by copying over the Multinomial implementation) :slight_smile:

1 Like

For the example notebook, I think I agree with one of your initial suggestions of doing something related to topic modelling. Maybe something along the lines of the Wikipedia integrated example?

Ooof! That Wikipedia page is challenging before 9am.

I’m liking the STAN tutorial on LDA a lot. Any ideas about where we can source “real” example data?

It’s also notable that there doesn’t seem to be any other LDA or topic modeling notebooks in the example zoo.

Here maybe: https://github.com/stan-dev/example-models/tree/master/misc/cluster/lda ?

Cool, will check it out.

I’m also seeing the example data in this Stan Discourse post as well as this scikit-learn function.

What is there not to like about a wall of equations? :stuck_out_tongue:

1 Like

The newspaper data sounds cool. I think the LDA will provide not only an interesting demonstration of the DM dist but also a nice pymc notebook in general.

I assume you mean the Associated Press data? (Which seems to be something of a standard example; it’s in Blei’s original paper…is it older than that?)

Any opinions on the best way to import example data? I figure there are a whole bunch of options:

  • Pull it from Blei’s website (might break in the future)
  • Pull it from an internet archive version. (Ugly)
  • Load it from the ‘topicmodels’ R-package (what most people do in R examples, but …would we use rpy2 or something?)
  • Use a similar dataset available in some Python package e.g. the newsgroups in sklearn example from above (new package dependency?)
  • Store a copy of the data in the examples repo (edit: <-------this seems like what all the other examples do)
  • TBD

Agreed! …But it’ll also be a good bit of work. I might start the process this week, but I’m not sure I’ll be able to push this example all the way through to completion any time soon.

I was referring to the Scipy function you mentioned: https://scikit-learn.org/0.16/modules/generated/sklearn.datasets.fetch_20newsgroups_vectorized.html#sklearn.datasets.fetch_20newsgroups_vectorized

Can’t we use that directly to get some nicely formatted data?

So with the new library dependency? Or do we archive the data to the repo?

To add to the complexity, it’s probably worth mentioning that these are high-dimensional models and I think the posterior is symmetric to permuting mixture-components…NUTS may not be the right tool for this job.

If I understand correctly that the scipy function downloads the data for you, I think it is fine to just have the code needed to download the data and not store it in the repo. For large data at least this makes more sense. If it is something smaller like the STAN example then we can just add it to GH.

Yeah, I wouldn’t be shocked that issues like that might crop out :slight_smile:. I will have some time next week to try something related to this. Do you have anything already worked out (or do you plan to have sometime soon) that I could work upon?

Other than that. I think the main PR should be ready for review soon. If we decide to enforce the a and n dimensions we can drop the batch test and just add a simple test checking the dimensions are being properly enforced. Otherwise I think I finally figured out how to translate the part of the batch test that is still failing.

What do you think? Is something else still missing?

Yes, I think the function will download the data for you, but no, it’s not SciPy. That example was Scikit-Learn, which I don’t think is an existing requirement for PyMC3.

Any idea what the policy is on using new packages in example notebooks?

I’ve been messing with an LDA model and have had some promising results but not really much for you to build off of yet. If it pans out I’ll upload something before next week, but I don’t think you should expect much.

Agreed! Looking forward to having it merged!

Ahh now I got it. Yeah we can ask the devs if we end up using it in a NB

Very excited that this is almost ready to be merged!

I’ve been trying to keep track of some additional PRs that were proposed while we did this work:

  • Assert shape constraints in the Multinomial the same we do for the DM.
  • Align the multinomial tests with the evolved DM tests.
  • Deprecate the mode attribute in the Multinomial, using _defaultval instead (is this a problem for any other distributions?)
  • For both DirichletMultinomial and Dirichlet distributions, add alpha parameter name and deprecate a for consistency with (most?) other distributions
  • Example notebook

Anything else?

Does the first also includes having shape be a required argument?

I think you mentioned everything that I had in mind. Do you want to take care of that PR?

I think the a to alpha is going to be tough though…

Just noticed this relevant example linked in another thread on this Discourse: Bayesian Topic Modeling

1 Like

Does the first also includes having shape be a required argument?

My intuition is just that we should keep the Multinomial and DM aligned in as many ways as possible (init, random, testing, etc.) I haven’t fully grokked exactly how much automatic shape inference we can do for the DirichletMultinomial, but if we’re pretty confident that shape should be required for DM), seems like it should be for Multinomial, too. I think you may be more up-to-speed, @ricardoV94, on this than I am.

I’ve been trying to keep track of some additional PRs that were proposed while we did this work:

Few more to add to this list based on additional discussion in the PR:

  • Clean up Multinomial._random to e.g. drop dead raw_size kwarg, suggested by @Sayam753
  • Dogfood PyMC3’s Dirichlet.random (and Multinomial.random?) method in DirichletMultinomial._random as suggested by @Sayam753
  • Update <Distribution>.random docstrings to explain how the point= argument works as suggested by @AlexAndorra
2 Likes