A newbie question: How should I formulate a classification problem using PyMC3?

I have a classification problem
I have a dataset of 1150 docs with variable number of words (from a vocabulary of 100,000) in each. The docs are labeled favorable / unfavorable
Additionally I have 800 docs that are not labeled. and I have to predict the label.
I read through the example below:
http://docs.pymc.io/notebooks/lda-advi-aevb.html

I thought I should specify number of topics as two…but in the example topic is automatically predicted based on the words in the document. In my case at least 1150 documents already have a topic (favorable / unfavorable)

Can someone help me in formulating the problem so that pymc3 / lda / aevb or parts thereof to solving this classification problem.

My apologies for a rudimentary question.

Depending on what is your assumption. If your assumption is that each document contains some latent topics and it is the topics that drive the favourable/ unfavourable labelling, you can first fit an LDA model as in the doc, and use the outputted label as input to a logistic regression.
You can also specify the number of topics as two, that way you are doing a large sparse logistic regression. You can still use a neural net like in the LDA doc as the approximation for inference.

1 Like

I have not seen an example of naive bayes with PyMC3, but that might also be a good approach here, right? The advantage (over, say, sklearn) would be having posterior estimates for each word. If you did not care about such estimates, then sklearn might be the right way to go.

1 Like

@junpenglao & @colcarroll

Thank you both !

Here’s an older example, requires some translation to PyMC3

1 Like

@mkesin,
Thank you :pray: