Hi,
tldr: How would pymc’s inference speed (in Bayesian nets) compare to other libraries such as pgmpy, pomegranate, or the commercial product pySMILE?
I’ve been searching for a replacement for pySMILE which we use for inference in Bayesian nets (such as around 200 nodes, 300 edges currently, with mostly 2 states/node, max. 4-5). First I managed to make pgmpy work which uses exact inference and quite easy to use, however it is slower compared to our current solution and using even “bigger HW” doesn’t help enough.
I also managed to make pomegranate work as well however it uses approximate inference giving a bit “off” results and seems to be even slower compared to pgmpy. (I’m only talking about CPU usage now.)
After these half-failed attempts I tried speeding things up using parallelization on modal which sort of works but in the long run would cost too much as I see it. By parallelization I don’t mean parallelizing 1 inference, but paralellizing this “algorithm”:
- do initial inference for all nodes/variables with 0 evidence (I use
node and variable interchangeably) - iterate over ~40-50-60% of our nodes (let’s call them “observables”;
size depends on the given network)- for every observable node set one of the states as evidence (do it
for all possible states, which number can usually be around 2-3-4) - calculate inference for the other 40-50-60% of nodes
- calculate a value for the observable node which shows how much
effect/impact changing its state has on the rest of the variables
(~comparing original inference with 0 evidence to the just now
calculated probabilities - this is not a complicated calculation)
This is basically calculating the impact of “observable” variables on other variables.
- for every observable node set one of the states as evidence (do it
One thing is that most likely this algorithm could be made much better (any ideas are welcome!).
After these not so promising results I finally turned towards pymc which seemed like a library that is not too easy to use for a beginner. I managed to make a small example work, however before actually writing more code (like creating functions that can properly set up ConditionalProbabilityTables, or handle multiple states/outcome of variables, etc.), I’d like to have more confidence that pymc could be faster than the alternatives.
Any thoughts on this?
I’m hoping that pytensor could speed things up, but then pomegranate uses pytorch which should also be fast+? On the other hand pomegranate can only calculate inference for all the nodes in the net which can be quite painful in regards to speed when you do hundreds of inference.
Approximate inference is not necessarily a problem, speed is more important than precise results (of course withing a reasonable threshold). Variational Inference is why I also have faith in pymc (+pytensor), which based on the introduction seems much faster.
(My initial solution is based on this: https://discourse.pymc.io/t/bayes-nets-belief-networks-and-pymc/5150/2, which doesn’t automatically work in newer pymc, you have to tweak a few things, such as set return_inferencedata to False. I’ve seen multiple threads where others also struggled to make it work btw, maybe this initiative could help:https://github.com/pymc-devs/pymc/discussions/6625 )
Thank you in advance for any replies!