Learning the structure of a Bayesian network

(A related question might be this)

I have a corpus of documents, each containing several terms and few codes. My objective is to estimate the conditional probability P(code|terms). In this question, I describe in more details the problem and a solution using pgmpy. As mentioned in the question, this solution is not scaling well when having many terms/codes per document. Not a big surprise.

Can I estimate a Bayesian network using PyMC3 in the setting described above? What would you recommend for a better scaling? In his comment, @junpenglao mentions observed vs. latent variables. How do they fit into the setting I’m facing?

Sounds like it is similar to latent dirichlet allocation https://docs.pymc.io/notebooks/lda-advi-aevb.html? Where each document has some topics (terms in your case) and each topic has some keywords (code in your case)

3 Likes

Thank you very much for this pointer! I’m trying to reproduce the notebook @junpenglao mentioned but I get an error at the 6th cell:

Exception: ("Compilation failed (return status=1): /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:27: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing].     int init_totals[2] = {V3_n0, V3_n1};.                           ^~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:27: note: insert an explicit cast to silence this issue.     int init_totals[2] = {V3_n0, V3_n1};.                           ^~~~~.                           static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:34: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing].     int init_totals[2] = {V3_n0, V3_n1};.                                  ^~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:510:34: note: insert an explicit cast to silence this issue.     int init_totals[2] = {V3_n0, V3_n1};.                                  ^~~~~.                                  static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:9: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing].         V3_stride0, V3_stride1, .         ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:9: note: insert an explicit cast to silence this issue.         V3_stride0, V3_stride1, .         ^~~~~~~~~~.         static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:21: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing].         V3_stride0, V3_stride1, .                     ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:522:21: note: insert an explicit cast to silence this issue.         V3_stride0, V3_stride1, .                     ^~~~~~~~~~.                     static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:1: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1. ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:1: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1. ^~~~~~~~~~. static_cast<int>( ). /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:13: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1.             ^~~~~~~~~~. /Users/dror/.theano/compiledir_macOS-10.15.4-x86_64-i386-64bit-i386-3.8.2-64/tmp5kxk23a8/mod.cpp:524:13: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1.             ^~~~~~~~~~.             static_cast<int>( ). 6 errors generated.. ", '[Elemwise{true_div,no_inplace}(TensorConstant{(128, 10) of 0.1}, <TensorType(float64, (True, True))>)]')

How can I fix it?

NB: It seems like the first line that causes the problem is:

with pm.Model() as model:
    theta = Dirichlet('theta', a=pm.floatX((1.0 / n_topics) * np.ones((minibatch_size, n_topics))), 
                      shape=(minibatch_size, n_topics), transform=t_stick_breaking(1e-9),
                      # do not forget scaling
                      total_size=n_samples_tr)

seems it is complaining about data type - since I cannot reproduce your error locally, could you try casting the args past to shape and total_size into int?

Can you please be more specific?

Never mind they are python int to begin with, hmmm I am not sure why you are getting error - maybe try upgrading your pymc3 and theano installation?

I believe it is rather updated. I am running this in an environment defined by:

name: icd10-infer
channels:
    - conda-forge
    - plotly
dependencies:
    - python=3.8
    - pip
    - pytest
    - ipykernel
    - "pandas<2.0"
    - "matplotlib<4.0"
    - "scikit-learn<0.23"
    - "pymc3<4.0"
    - mkl-service
    - "python-graphviz<0.14"
    - pydot
    - "bidict<0.20"
    - tqdm
    - plotly=4.6.0
    - pip:
        - pgmpy==0.1.9
        - dvc<1.0
        - streamlit

Could it be due to Python 3.8?