Hi, I am new to Pymc3 and I am working on sentiment analysis with a topic model that has more nested latent variables and more hierarchies than LDA. I want to use Gibbs Sampling methods in Pymc3 to inference the latent word distribution for both sentiment words and non-sentiment words.
In my model, each document consists of a sequence of tokens (words). Each token has a latent sentiment and a latent category label. The category label indicate whether the word is a sentiment word or a background word. The exact words are drawn from different distribution according to their latent category labels.
I have two general question. The first one is I got this error when running the code (see below).
Exception: ('Compilation failed (return status=1): /Users/wang/.theano/compiledir_Darwin-18.6.0-x86_64-i386-64bit-i386-3.7.3-64/tmpb20_3emw/mod.cpp:37020:32: fatal error: bracket nesting level exceeded maximum of 256. if (!PyErr_Occurred()) {. ^. /Users/wang/.theano/compiledir_Darwin-18.6.0-x86_64-i386-64bit-i386-3.7.3-64/tmpb20_3emw/mod.cpp:37020:32: note: use -fbracket-depth=N to increase maximum nesting level. 1 error generated.. '
Codes:
### Model specification
logging.info("Starting model")
with pm.Model() as model:
pi = pm.Dirichlet(f"pi", a=gamma, shape=(C, C))
phi_s = pm.Dirichlet(f"phi_s", a=beta_s, shape=(S, V))
phi_c = pm.Dirichlet(f"phi_c", a=delta, shape=(C-1, V))
logging.info("processed priors")
theta_s = pm.Dirichlet(f"theta_s", a=alpha_s, shape=(D, S)) # Sentiment distribution
logging.info("processed review level distribution")
## For each review
for d, review in enumerate(X_id):
# Batch logging every 10 documents
if d % 10 == 0:
logging.info("processing review {}".format(d))
## For each reivew
S = pm.Categorical(f"s_{d}", p=theta_s[d], shape=len(review))
## For each word in review
c_pre = None # category of the previous word (in this review)
for w, word in enumerate(review):
s = S[w]
# the first word has no transition
if not c_pre:
c = pm.Categorical(f"c_{d}_{w}", p=np.ones(C))
else:
c = pm.Categorical(f"c_{d}_{w}", p=pi[c_pre])
# sentiment words
if c == 1:
ww = pm.Categorical(f"ww_{d}_{w}", p=phi_s[s], observed=word)
# background words
else:
ww = pm.Categorical(f"ww_{d}_{w}", p=phi_c[c], observed=word)
c_pre = c
logging.info("sampling begins")
with model:
trace = pm.sample(draws=10, tune=1, chains=1, nuts_kwargs={'target_accept': 0.9})
logging.info("sampling completes")
The second one is I don’t know how can I improve the model specification within the with pm.Model() as model:
block. I know I abusively used for loops within that but I don’t know how to improve it even after reading this tutorial. I tried something to put pm.Category()
and pm.Dirichlet()
outside of loop while keeping the same algorithm. It did make the model compiling faster, but still have the same errors. I don’t know what else can I do in terms of the tokens generation because it is conditioned on the category labels.
Attached parameters definition:
### Global parameters
# number of documents
D = len(X_id)
logging.info("Setting D documents: {0}".format(D))
# number of unique words
V = len(nlp.vocab)
logging.info("Setting V Vocab: {0}".format(V))
# number of sentiment
S = 2
logging.info("Setting S sentiments: {0}".format(S))
# number of words category
C = 2
logging.info("Setting C word categories: {0}".format(C))
### Hyperparameters
alpha_s = 50 * np.ones(S) / S
beta_s = 0.1 * np.ones(V) # distribution of sentiment words
delta = 0.1 * np.ones(V) # distribution of background words
gamma = 0.1 * np.ones(C) # transition matrix for word categories