(A related question might be this)
I have a corpus of documents, each containing several terms and few codes. My objective is to estimate the conditional probability P(code|terms). In this question, I describe in more details the problem and a solution using pgmpy. As mentioned in the question, this solution is not scaling well when having many terms/codes per document. Not a big surprise.
Can I estimate a Bayesian network using PyMC3 in the setting described above? What would you recommend for a better scaling? In his comment, @junpenglao mentions observed vs. latent variables. How do they fit into the setting I’m facing?
Sounds like it is similar to latent dirichlet allocation https://docs.pymc.io/notebooks/lda-advi-aevb.html? Where each document has some topics (terms in your case) and each topic has some keywords (code in your case)
Thank you very much for this pointer! I’m trying to reproduce the notebook @junpenglao mentioned but I get an error at the 6th cell:
Exception: ("Compilation failed (return status=1): /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:27: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:27: note: insert an explicit cast to silence this issue. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:34: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:34: note: insert an explicit cast to silence this issue. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:9: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V3_stride0, V3_stride1, . ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:9: note: insert an explicit cast to silence this issue. V3_stride0, V3_stride1, . ^~~~~~~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:21: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V3_stride0, V3_stride1, . ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:21: note: insert an explicit cast to silence this issue. V3_stride0, V3_stride1, . ^~~~~~~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:1: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1. ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:1: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1. ^~~~~~~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:13: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1. ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:13: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1. ^~~~~~~~~~. static_cast<int>( ). 6 errors generated.. ", '[Elemwise{true_div,no_inplace}(TensorConstant{(128, 10) of 0.1}, <TensorType(float64, (True, True))>)]')
How can I fix it?
NB: It seems like the first line that causes the problem is:
with pm.Model() as model:
theta = Dirichlet('theta', a=pm.floatX((1.0 / n_topics) * np.ones((minibatch_size, n_topics))),
shape=(minibatch_size, n_topics), transform=t_stick_breaking(1e-9),
# do not forget scaling
total_size=n_samples_tr)
seems it is complaining about data type - since I cannot reproduce your error locally, could you try casting the args past to shape and total_size into int?
Can you please be more specific?
Never mind they are python int to begin with, hmmm I am not sure why you are getting error - maybe try upgrading your pymc3 and theano installation?
I believe it is rather updated. I am running this in an environment defined by:
name: icd10-infer
channels:
- conda-forge
- plotly
dependencies:
- python=3.8
- pip
- pytest
- ipykernel
- "pandas<2.0"
- "matplotlib<4.0"
- "scikit-learn<0.23"
- "pymc3<4.0"
- mkl-service
- "python-graphviz<0.14"
- pydot
- "bidict<0.20"
- tqdm
- plotly=4.6.0
- pip:
- pgmpy==0.1.9
- dvc<1.0
- streamlit
Could it be due to Python 3.8?