Topic modeling in PYMC

Assuming that I have a date set of say CNN’s new highlights or list of documents say 1 to 5.

  1. How can I draw its word distribution using V-dimensional symmetric Dirichlet distribution: φk ∼ Dir β , 1 ≤ k ≤ K
  2. How can I draw a topic distribution using K-dimensional symmetric Dirichlet distribution:θm ∼ Dir α , 1 ≤ m ≤ M
  3. draw a topic for that word for each word in the document
    according to a Multinomial (Categorical) distribution:zm,n ∼ Multinomial θm , 1 ≤ m ≤ M, 1 ≤ n ≤ Nm
  4. draw a physical word using wm,n ∼ Multinomial φzm,n

, 1 ≤ m ≤ M, 1 ≤ n ≤ Nm

for example

image
image
image

How do I build the observed variable? wm,n 1 ≤ m ≤ M, 1 ≤ n ≤ Nm
Infer the hidden topic structure:
θm 1 ≤ m ≤ M
φk 1 ≤ k ≤ K

Trace also: zm,n1 ≤ m ≤ M, 1 ≤ n ≤ Nm

1 Like

Have you taken a look at this notebook? If not, it might be a place to get you started.

1 Like