I have a model that I would like to use with ~200,000 different but identical data sets (measurements of ~200,000 different stars). I would like to run sampling on all of these objects, so to have this execute in finite time I’m trying to run this on our compute cluster over ~10 notes (each with 40 cores). The way I’ve structured my code, I construct the model and read in the data on the main process, then use MPI to send batches of data + the model to each worker process. On each worker, I then iterate through the batch of data and use pm.set_data(...)
to set the few star-specfic observables before running. But this isn’t working for me because a large fraction (but not all) of the workers die when Theano tries to use subprocess.Popen
to compile the model with the new data (I think):
Traceback (most recent call last):
File "/mnt/ceph/users/apricewhelan/projects/schwimmbad/schwimmbad/mpi.py", line 81, in __init__
self.wait()
File "/mnt/ceph/users/apricewhelan/projects/schwimmbad/schwimmbad/mpi.py", line 135, in wait
result = func(arg)
File "run-mixmodel.py", line 40, in worker
model = helper.get_model(**model_kw)
File "/mnt/ceph/users/apricewhelan/projects/cuddly-system/scripts/model.py", line 124, in get_model
M = pm.Data('M', np.eye(3))
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/pymc3/data.py", line 547, in __new__
shared_object.dshape = tuple(shared_object.shape.eval())
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/tensor/var.py", line 287, in <lambda>
shape = property(lambda self: theano.tensor.basic.shape(self))
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/op.py", line 670, in __call__
no_recycling=[])
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1157, in __compile__
keep_lock=keep_lock)
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1624, in cthunk_factory
key=key, lnk=self, keep_lock=keep_lock)
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cmodule.py", line 1189, in module_from_key
module = lnk.compile_cmodule(location)
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/cc.py", line 1527, in compile_cmodule
preargs=preargs)
File "/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4Problem occurred during compilation with the command line below:
/cm/shared/sw/pkg/devel/gcc/7.4.0/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -march=broadwell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=35840 -mtune=broadwell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/cm/shared/sw/pkg/devel/python3/3.7.3/lib/python3.7/site-packages/numpy/core/include -I/cm/shared/sw/pkg/devel/python3/3.7.3/include/python3.7m -I/mnt/home/apricewhelan/.local/lib/python3.7/site-packages/Theano-1.0.4+51.gf1e4ec4-py3.7.egg/theano/gof/c_code -L/cm/shared/sw/pkg/devel/python3/3.7.3/lib -fvisibility=hidden -o /mnt/home/apricewhelan/.theano/1714928/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.8.2003-Core-x86_64-3.7.3-64/tmp6miwlfs0/mf0c597995cda5a5ddd6f2023fd9404c7f21824ed4f7f0d5318dee29bd7fd2b7c.so /mnt/home/apricewhelan/.theano/1714928/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.8.2003-Core-x86_64-3.7.3-64/tmp6miwlfs0/mod.cpp -lpython3.7m
ERROR (theano.gof.cmodule): [Errno 14] Bad address: '/cm/shared/sw/pkg/devel/gcc/7.4.0/bin/g++'
It has been extremely frustrating / impossible to debug because it only happens to some workers, and for some reason seems to go away for an indeterminate amount of time if I use a fresh conda installation of python and dependencies instead of the cluster-installed versions. But after ~a few test runs, it starts to fail with the same error even in that environment. Neither me nor our HPC staff can figure out what is causing this, and I haven’t found anyone else with similar issues, so it must be something about the way our cluster is configured(?).
All of that to say: I want to find a workaround because I’m stumped on this and want to find a way to get this code to run. So, one thing I’ve been trying to figure out is whether there is a way to precompile everything on the main process before pickling and sending the model out to each worker process (it seems to be happy to compile on the main process). Is this something that would be doable with pymc3? To precompile the logp
and dlogp
functions and tell pm.sample()
to use the precompiled versions? Or does it have to recompile each time pm.set_data()
is called?