ConnectionResetError when using multiprocessing with more than one core on Linux

This is my first time posting here, so let me know if I can improve my question!

I want to perform Bayesian inference for a model that was implemented in C++ (using openmp, if that matters). I wrapped the code using cython and made it accessible from python through the function cy_model:

Order = tr.Ordered()
Logodd = tr.LogOdds()
chain_tran = tr.Chain([Logodd, Order])

with pm.Model() as astro_fit:

    M = pm.Normal("M", mu=160.5, sigma=28.6)
    T = pm.Normal("T", mu=1000.0, sigma=20.0)
    P = pm.Uniform("P", lower=10000.0, upper=500000.0)
    a = pm.Uniform("a", lower=0.0, upper=1.5, shape=3,
                   transform=chain_tran, testval=[0.3, 0.5, 0.8])

    steps = theano.shared(1e6)

    radius = cy_model(M, P, T, steps, a[0], a[1], a[2])

    r_obs = pm.Normal("r_obs", mu=radius,
                      sigma=1.14, observed=9.97)

    step = pm.Metropolis([M, P, T, a], blocked=True)

    trace = pm.sample(500, tune=200, init='adapt_diag', chains=4,
                      cores=4, step=step, return_inferencedata=False,
                      compute_convergence_checks=True)

The derivatives for cy_model are not available. Therefore, I am resorting to Metropolis.

I am trying to run this on a Linux server through the job management software slurm and get this result after around 20 successfully completed steps :

Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [a, T, P, M]
Traceback (most recent call last):
File “/data/ms2054/MCMC-Astro/exec_MCMC_45b.py”, line 126, in
trace = pm.sample(1000, tune=500, init=‘adapt_diag’, chains=4,
File “/data/ms2054/MCMC-Astro/env/lib/python3.9/site-packages/pymc3/sampling.py”, line 559, in sample
trace = _mp_sample(**sample_args, **parallel_args)
File “/data/ms2054/MCMC-Astro/env/lib/python3.9/site-packages/pymc3/sampling.py”, line 1477, in _mp_sample
for draw in sampler:
File “/data/ms2054/MCMC-Astro/env/lib/python3.9/site-packages/pymc3/parallel_sampling.py”, line 479, in iter
draw = ProcessAdapter.recv_draw(self._active)
File “/data/ms2054/MCMC-Astro/env/lib/python3.9/site-packages/pymc3/parallel_sampling.py”, line 351, in recv_draw
msg = ready[0].recv()
File “/data/ms2054/MCMC-Astro/env/lib/python3.9/multiprocessing/connection.py”, line 255, in recv
buf = self._recv_bytes()
File “/data/ms2054/MCMC-Astro/env/lib/python3.9/multiprocessing/connection.py”, line 419, in _recv_bytes
buf = self._recv(4)
File “/data/ms2054/MCMC-Astro/env/lib/python3.9/multiprocessing/connection.py”, line 384, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

This does not happen if I set cores=1.
I am quite new to pymc3 and have no suspicion what could be causing this. The forward calculation via cy_model varies in how long each calculation takes, depending on the input parameters. Some calculations may take ~20s while others may take around 1 minute. Could that cause issues?

I am using:
python 3.9.4
pymc3 3.11.2
theano-pymc 1.1.2
all installed via conda.

I would be happy about any help!

1 Like