Questions from replicating Latent Dirichlet Allocation work

I have been trying to do some LDA (Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022.) work, and getting odd results, so I am trying to replicate using PyMC3 and MCMC (because I don’t understand Variational Inference well – it seems like magic to me).

Since I was getting such odd results, I decided to replicate the model for AP articles that is available here: GitHub - blei-lab/lda-c: This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data.

For that, I developed the following PyMC3 model:

with pm.Model() as model:
    #alpha = pm.Data("α", np.ones(K))
    alpha = np.ones(K)
    # beta_prior = pm.Data("β0", np.ones((1, V)))
    beta_prior =  np.ones((1, V))
    doc_num = pm.Data('i', df['Document'])
    theta = pm.Dirichlet("θ", a=alpha, shape=(D, K)) # topic probabilities
    beta = pm.Dirichlet("beta", a=beta_prior, shape=(K, V)) # word probabilities by topic
    comps = pm.Categorical.dist(p=beta, shape=(K, V))
    w = pm.Mixture("w", w=theta[doc_num], comp_dists=comps, observed=df['Word'])

[I tried building a simpler model without using `pm.Mixture`, but it just blew the stack on even a very beefy server with the stack size expanded by a factor of 4.]

This model is excruciatingly slow to sample, and gives me a bunch of divergences to boot.

I was hoping for some general advice/answers:

  1. The vocabulary is relatively large (approximately 10K entries, so the categorical distribution is 10K wide) and the number of observations is also pretty large (389,701). Question: is this just more than I should expect PyMC3 to handle?

  2. Is there something I could do to the way the model is structured to speed up the sampling? Is the number of observations causing the gradient computation to be just too expensive?

  3. Any guidance in choosing the priors for α and β? I can’t say for certain, but I was wondering if the divergences might be coming from poor choices of prior. By using 1 as the parameter for the p(word|topic) prior, I am making the claim that the different topics smear the probability mass across the whole vocabulary, instead of being more clumpy. But I don’t have any intuition about how and how far I might drop that prior to get something more reasonable. I don’t see any reason to change the α prior, since there doesn’t seem to be any a priori reason to expect that the topics will be distributed unevenly across the papers. I’d welcome suggestions, or pointers to the literature with discussion about priors for Dirichlet distributions (especially since this is effectively Google-proof – any query I can come up with just gives me Dirichlets as priors, not priors for Dirichlets!).

Oh, OK, duh, I see: the system is saving a full array of topic x word probabilities for each entry in the trace, or about 30 MB/sample.

As a Python programmer, I know the value of everything, and the cost of nothing!

Keeping the full trace in memory is obviously never going to work for something as big as this.

Still curious about the Dirichlet priors, though.