I have a model that I would like to use with ~200,000 different but identical data sets (measurements of ~200,000 different stars). I would like to run sampling on all of these objects, so to have this execute in finite time I’m trying to run this on our compute cluster over ~10 notes (each with 40 cores). The way I’ve structured my code, I construct the model and read in the data on the main process, then use MPI to send batches of data + the model to each worker process. On each worker, I then iterate through the batch of data and use
pm.set_data(...) to set the few star-specfic observables before running. But this isn’t working for me because a large fraction (but not all) of the workers die when Theano tries to use
subprocess.Popen to compile the model with the new data (I think):
Traceback (most recent call last): File "/mnt/ceph/users/apricewhelan/projects/schwimmbad/schwimmbad/mpi.py", line 81, in __init__ self.wait() File "/mnt/ceph/users/apricewhelan/projects/schwimmbad/schwimmbad/mpi.py", line 135, in wait result = func(arg) File "run-mixmodel.py", line 40, in worker model = helper.get_model(**model_kw) File "/mnt/ceph/users/apricewhelan/projects/cuddly-system/scripts/model.py", line 124, in get_model M = pm.Data('M', np.eye(3)) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/pymc3/data.py", line 547, in __new__ shared_object.dshape = tuple(shared_object.shape.eval()) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/tensor/var.py", line 287, in <lambda> shape = property(lambda self: theano.tensor.basic.shape(self)) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/op.py", line 670, in __call__ no_recycling=) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/op.py", line 955, in make_thunk no_recycling) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/op.py", line 858, in make_c_thunk output_storage=node_output_storage) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1217, in make_thunk keep_lock=keep_lock) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1157, in __compile__ keep_lock=keep_lock) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1624, in cthunk_factory key=key, lnk=self, keep_lock=keep_lock) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cmodule.py", line 1189, in module_from_key module = lnk.compile_cmodule(location) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1527, in compile_cmodule preargs=preargs) File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4Problem occurred during compilation with the command line below: /cm/shared/sw/pkg/devel/gcc/7.4.0/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -march=broadwell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=35840 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/cm/shared/sw/pkg/devel/python3/3.7.3/lib/python3.7/site-packages/numpy/core/include -I/cm/shared/sw/pkg/devel/python3/3.7.3/include/python3.7m -I/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/c_code -L/cm/shared/sw/pkg/devel/python3/3.7.3/lib -fvisibility=hidden -o /mnt/home/apricewhelan/.theano/1714928/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.8.2003-Core-x86_64-3.7.3-64/tmp6miwlfs0/mf0c597995cda5a5ddd6f2023fd9404c7f21824ed4f7f0d5318dee29bd7fd2b7c.so /mnt/home/apricewhelan/.theano/1714928/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.8.2003-Core-x86_64-3.7.3-64/tmp6miwlfs0/mod.cpp -lpython3.7m ERROR (theano.gof.cmodule): [Errno 14] Bad address: '/cm/shared/sw/pkg/devel/gcc/7.4.0/bin/g++'
It has been extremely frustrating / impossible to debug because it only happens to some workers, and for some reason seems to go away for an indeterminate amount of time if I use a fresh conda installation of python and dependencies instead of the cluster-installed versions. But after ~a few test runs, it starts to fail with the same error even in that environment. Neither me nor our HPC staff can figure out what is causing this, and I haven’t found anyone else with similar issues, so it must be something about the way our cluster is configured(?).
All of that to say: I want to find a workaround because I’m stumped on this and want to find a way to get this code to run. So, one thing I’ve been trying to figure out is whether there is a way to precompile everything on the main process before pickling and sending the model out to each worker process (it seems to be happy to compile on the main process). Is this something that would be doable with pymc3? To precompile the
dlogp functions and tell
pm.sample() to use the precompiled versions? Or does it have to recompile each time
pm.set_data() is called?