(A related question might be this)
I have a corpus of documents, each containing several terms and few codes. My objective is to estimate the conditional probability P(code|terms). In this question, I describe in more details the problem and a solution using pgmpy. As mentioned in the question, this solution is not scaling well when having many terms/codes per document. Not a big surprise.
Can I estimate a Bayesian network using PyMC3 in the setting described above? What would you recommend for a better scaling? In his comment, @junpenglao mentions observed vs. latent variables. How do they fit into the setting I’m facing?
Sounds like it is similar to latent dirichlet allocation https://docs.pymc.io/notebooks/lda-advi-aevb.html? Where each document has some topics (terms in your case) and each topic has some keywords (code in your case)
3 Likes
Thank you very much for this pointer! I’m trying to reproduce the notebook @junpenglao mentioned but I get an error at the 6th cell:
Exception: ("Compilation failed (return status=1): /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:27: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:27: note: insert an explicit cast to silence this issue. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:34: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:34: note: insert an explicit cast to silence this issue. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:9: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V3_stride0, V3_stride1, . ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:9: note: insert an explicit cast to silence this issue. V3_stride0, V3_stride1, . ^~~~~~~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:21: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V3_stride0, V3_stride1, . ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:21: note: insert an explicit cast to silence this issue. V3_stride0, V3_stride1, . ^~~~~~~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:1: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1. ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:1: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1. ^~~~~~~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:13: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1. ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:13: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1. ^~~~~~~~~~. static_cast<int>( ). 6 errors generated.. ", '[Elemwise{true_div,no_inplace}(TensorConstant{(128, 10) of 0.1}, <TensorType(float64, (True, True))>)]')
How can I fix it?
NB: It seems like the first line that causes the problem is:
with pm.Model() as model:
theta = Dirichlet('theta', a=pm.floatX((1.0 / n_topics) * np.ones((minibatch_size, n_topics))),
shape=(minibatch_size, n_topics), transform=t_stick_breaking(1e-9),
# do not forget scaling
total_size=n_samples_tr)
seems it is complaining about data type - since I cannot reproduce your error locally, could you try casting the args past to shape
and total_size
into int?
Can you please be more specific?
Never mind they are python int to begin with, hmmm I am not sure why you are getting error - maybe try upgrading your pymc3 and theano installation?
I believe it is rather updated. I am running this in an environment defined by:
name: icd10-infer
channels:
- conda-forge
- plotly
dependencies:
- python=3.8
- pip
- pytest
- ipykernel
- "pandas<2.0"
- "matplotlib<4.0"
- "scikit-learn<0.23"
- "pymc3<4.0"
- mkl-service
- "python-graphviz<0.14"
- pydot
- "bidict<0.20"
- tqdm
- plotly=4.6.0
- pip:
- pgmpy==0.1.9
- dvc<1.0
- streamlit
Could it be due to Python 3.8?