Dataset size dependent EOFError

Hello

I’m running the following model:

import numpy as np
import pymc as pm
import pytensor.tensor as tt

# simulate some data
np.random.seed(42)

n = 10_000  # number of individuals
group_size = n // 2

t = 12  # number of time intervals

# event probabilites for group 1 and 2
p1 = 0.01
p2 = 0.05


def simulate_group(n, p):
    return np.random.choice((0, 1), size=(n, t), p=(1 - p, p))


# observed data
observed = np.vstack((simulate_group(group_size, p1), simulate_group(group_size, p2)))

# group indicators
group = np.hstack((np.zeros(group_size), np.ones(group_size)))


with pm.Model(coords=dict(intervals=range(t))):
    lam0 = pm.Gamma(
        "lam0",
        mu=pm.Gamma("lam0_mu", mu=0.5, sigma=0.5),
        sigma=pm.Gamma("lam0_sigma", mu=0.5, sigma=0.1),
        dims="intervals",
    )
    beta = pm.Normal("beta", mu=0, sigma=1)
    lam = tt.outer(tt.exp(beta * group), lam0)
    pm.Poisson("obs", lam, observed=observed)
    pm.sample()

Which generates the following error:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [lam0_mu, lam0_sigma, lam0, beta]
Traceback (most recent call last):-----------------------------------| 0.00% [0/8000 00:00<? Sampling 4 chains, 0 divergences]
  File "<path>/test.py", line 38, in <module>
    pm.sample()
  File "<path>/lib/python3.10/site-packages/pymc/sampling/mcmc.py", line 666, in sample
    _mp_sample(**sample_args, **parallel_args)
  File "<path>/lib/python3.10/site-packages/pymc/sampling/mcmc.py", line 1055, in _mp_sample
    for draw in sampler:
  File "<path>/lib/python3.10/site-packages/pymc/sampling/parallel.py", line 448, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "<path>/lib/python3.10/site-packages/pymc/sampling/parallel.py", line 320, in recv_draw
    msg = ready[0].recv()
  File "<path>/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "<path>/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "<path>/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError

Setting n to something lower (e.g. 1000) resolves the error, and the model samples as expected.

Is there a way to run this with larger n?

Thanks
David

I’m using pymc 5.1.1

Can you try with the last version of PyMC?

Same behaviour on pymc 5.3.0.

Im getting the same issue with a dataset at work on pymc 5.3.1 (M1 mac).

Start getting the error consistently when the dataset size is N=2235, but consistently works for N=2234. Unfortunately cant share reproducible code for that one, but here’s the model block

with pm.Model(coords=coords) as m_sku:
    
    alpha = pm.Normal("alpha", 0, 1, dims='product')
    hour_effect = pm.Normal("beta",0,1,dims='hour')
    size_effect = pm.Normal("Bsize", 0, 0.5, dims='size')
    lambd = pm.Deterministic('lambd', pm.math.exp( alpha[i] + hour_effect[h] + (size_effect @ X.T)  ) ) 
    
    dist = pm.Poisson.dist(mu=lambd)
    obs = pm.Censored("obs", dist, lower=None, upper=data.available.values, observed=data.rentals)
    idata2 = pm.sample(idata_kwargs={"log_likelihood": True})

I then realized that my design matrix was un-identifiable and dropped a dummy column and then the max dataset size I could go up to without an EOFerror was N=2457 (even if I randomly sampled the dataset, it seems to be a consistent limit).

Error:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
/var/folders/p4/f1r08vbs6lqbl816qzghkwtr0000gp/T/ipykernel_35693/2207676018.py in <cell line: 19>()
     27     cens = np.where(data.available <= data.rentals, data.available, np.inf)
     28     obs = pm.Censored("obs", dist, lower=0, upper=cens, observed=data.rentals)
---> 29     idata2 = pm.sample(idata_kwargs={"log_likelihood": True})

~/.pyenv/versions/3.9.7/envs/local_env/lib/python3.9/site-packages/pymc/sampling/mcmc.py in sample(draws, tune, chains, cores, random_seed, progressbar, step, nuts_sampler, initvals, init, jitter_max_retries, n_init, trace, discard_tuned_samples, compute_convergence_checks, keep_warning_stat, return_inferencedata, idata_kwargs, nuts_sampler_kwargs, callback, mp_ctx, model, **kwargs)
    675         _print_step_hierarchy(step)
    676         try:
--> 677             _mp_sample(**sample_args, **parallel_args)
    678         except pickle.PickleError:
    679             _log.warning("Could not pickle model, sampling singlethreaded.")

~/.pyenv/versions/3.9.7/envs/local_env/lib/python3.9/site-packages/pymc/sampling/mcmc.py in _mp_sample(draws, tune, step, chains, cores, random_seed, start, progressbar, traces, model, callback, mp_ctx, **kwargs)
   1064         try:
   1065             with sampler:
-> 1066                 for draw in sampler:
   1067                     strace = traces[draw.chain]
   1068                     strace.record(draw.point, draw.stats)

~/.pyenv/versions/3.9.7/envs/local_env/lib/python3.9/site-packages/pymc/sampling/parallel.py in __iter__(self)
    446 
    447         while self._active:
--> 448             draw = ProcessAdapter.recv_draw(self._active)
    449             proc, is_last, draw, tuning, stats = draw
    450             self._total_draws += 1

~/.pyenv/versions/3.9.7/envs/local_env/lib/python3.9/site-packages/pymc/sampling/parallel.py in recv_draw(processes, timeout)
    318         idxs = {id(proc._msg_pipe): proc for proc in processes}
    319         proc = idxs[id(ready[0])]
--> 320         msg = ready[0].recv()
    321 
    322         if msg[0] == "error":

~/.pyenv/versions/3.9.7/lib/python3.9/multiprocessing/connection.py in recv(self)
    253         self._check_closed()
    254         self._check_readable()
--> 255         buf = self._recv_bytes()
    256         return _ForkingPickler.loads(buf.getbuffer())
    257 

~/.pyenv/versions/3.9.7/lib/python3.9/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    417 
    418     def _recv_bytes(self, maxsize=None):
--> 419         buf = self._recv(4)
    420         size, = struct.unpack("!i", buf.getvalue())
    421         if size == -1:

~/.pyenv/versions/3.9.7/lib/python3.9/multiprocessing/connection.py in _recv(self, size, read)
    386             if n == 0:
    387                 if remaining == size:
--> 388                     raise EOFError
    389                 else:
    390                     raise OSError("got end of file during message")

EOFError: 

edit: setting pm.sample(...,cores=1) seems to atleast work for now for me atleast @davipatti

Aha, let’s dig this one up from the past… because right now I’m getting the exact same issue. I debugged all the way down to the same region of code multiprocessing/connection.py:399 and still have no idea how this would be dataset-size dependent.

My pymc model is very different to the above, and I expect the versions of everything are different too. I first saw this using an env based around MacOS 14.7, python 3.11 & pymc v5.16, then thought this would be a good time to upgrade everything anyway, and installed MAcOS 15.2, and a new env with python 3.12, pymc 5.20 etc. All bang up to date,

My only guess is it might be something to do with available RAM, but iStats / htop show that there’s available RAM at the time, so maybe it’s an addressing problem or an object size limit somewhere inside multiprocessing?? Something maybe to do with multiprocessing — Process-based parallelism — Python 3.13.1 documentation ?

A lousy “fix” is to avoid using multiprocessing by setting cores=1.

FWIW this is the general summary of my current machine / stack

MacOS 15.2 (Sequoia), Macbook Air M2 24GB RAM

$> xcode-select -v
xcode-select version 2409.

Issue is seen when using any typical IDE:

  • VSCode v1.96.4 (Jupyter extension ms-toolsai.jupyter v2024.10.0),
  • straight Jupyter Lab launched from terminal
  • straight Jupyter Notebook launched from terminal

Environment is fairly normal (I hope), installed via conda-forge

Python implementation: CPython
Python version       : 3.12.8
IPython version      : 8.31.0

ipykernel  : 6.29.5
pymc    : 5.20.0
pytensor: 2.26.4

Compiler    : Clang 18.1.8 
OS          : Darwin
Release     : 24.2.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit

sys       : 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:19:53) [Clang 18.1.8 ]

I think (hope!) I’m using Accelerate and Clang 18.18

$ > mamba list

...
libblas                             3.9.0           26_osxarm64_accelerate        conda-forge
libcblas                            3.9.0           26_osxarm64_accelerate        conda-forge
liblapack                           3.9.0           26_osxarm64_accelerate        conda-forge
...
libclang-cpp18.1                    18.1.8          default_h5c12605_5            conda-forge
...

Although import numpy as np; np.__config__.show() yields references to clang 16.0.6, which bothers me a little:

Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /Users/jon/miniforge/envs/vulcan/include
    lib directory: /Users/jon/miniforge/envs/vulcan/lib
    name: blas
    openblas configuration: unknown
    pc file directory: /Users/jon/miniforge/envs/vulcan/lib/pkgconfig
    version: 3.9.0
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4569863840
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem,
      /Users/jon/miniforge/envs/vulcan/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/jon/miniforge/envs/vulcan=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/jon/miniforge/envs/vulcan/include, -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/jon/miniforge/envs/vulcan/lib,
      -L/Users/jon/miniforge/envs/vulcan/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong,
      -O2, -pipe, -isystem, /Users/jon/miniforge/envs/vulcan/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/jon/miniforge/envs/vulcan=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/jon/miniforge/envs/vulcan/include, -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  c++:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/jon/miniforge/envs/vulcan/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/jon/miniforge/envs/vulcan=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/jon/miniforge/envs/vulcan/include, -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang++
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/jon/miniforge/envs/vulcan/lib,
      -L/Users/jon/miniforge/envs/vulcan/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong,
      -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0,
      -isystem, /Users/jon/miniforge/envs/vulcan/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/jon/miniforge/envs/vulcan=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/jon/miniforge/envs/vulcan/include, -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.8
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  cross-compiled: true
  host:
    cpu: arm64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/jon/miniforge/envs/vulcan/bin/python
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Just for fun the relevant part of the stacktrace:

File ~/miniforge/envs/vulcan/lib/python3.12/site-packages/pymc/sampling/mcmc.py:906, in sample(draws, tune, chains, cores, random_seed, progressbar, progressbar_theme, step, var_names, nuts_sampler, initvals, init, jitter_max_retries, n_init, trace, discard_tuned_samples, compute_convergence_checks, keep_warning_stat, return_inferencedata, idata_kwargs, nuts_sampler_kwargs, callback, mp_ctx, blas_cores, model, compile_kwargs, **kwargs)
    904 _print_step_hierarchy(step)
    905 try:
--> 906     _mp_sample(**sample_args, **parallel_args)
    907 except pickle.PickleError:
    908     _log.warning("Could not pickle model, sampling singlethreaded.")

File ~/miniforge/envs/vulcan/lib/python3.12/site-packages/pymc/sampling/mcmc.py:1318, in _mp_sample(draws, tune, step, chains, cores, rngs, start, progressbar, progressbar_theme, traces, model, callback, blas_cores, mp_ctx, **kwargs)
   1316 try:
   1317     with sampler:
-> 1318         for draw in sampler:
   1319             strace = traces[draw.chain]
   1320             strace.record(draw.point, draw.stats)

File ~/miniforge/envs/vulcan/lib/python3.12/site-packages/pymc/sampling/parallel.py:478, in ParallelSampler.__iter__(self)
    471 task = progress.add_task(
    472     self._desc.format(self),
    473     completed=self._completed_draws,
    474     total=self._total_draws,
    475 )
    477 while self._active:
--> 478     draw = ProcessAdapter.recv_draw(self._active)
    479     proc, is_last, draw, tuning, stats = draw
    480     self._completed_draws += 1

File ~/miniforge/envs/vulcan/lib/python3.12/site-packages/pymc/sampling/parallel.py:334, in ProcessAdapter.recv_draw(processes, timeout)
    332 idxs = {id(proc._msg_pipe): proc for proc in processes}
    333 proc = idxs[id(ready[0])]
--> 334 msg = ready[0].recv()
    336 if msg[0] == "error":
    337     old_error = msg[1]

File ~/miniforge/envs/vulcan/lib/python3.12/multiprocessing/connection.py:250, in _ConnectionBase.recv(self)
    248 self._check_closed()
    249 self._check_readable()
--> 250 buf = self._recv_bytes()
    251 return _ForkingPickler.loads(buf.getbuffer())

File ~/miniforge/envs/vulcan/lib/python3.12/multiprocessing/connection.py:430, in Connection._recv_bytes(self, maxsize)
    429 def _recv_bytes(self, maxsize=None):
--> 430     buf = self._recv(4)
    431     size, = struct.unpack("!i", buf.getvalue())
    432     if size == -1:

File ~/miniforge/envs/vulcan/lib/python3.12/multiprocessing/connection.py:399, in Connection._recv(self, size, read)
    397 if n == 0:
    398     if remaining == size:
--> 399         raise EOFError
    400     else:
    401         raise OSError("got end of file during message")

EOFError: 

When does it fail? Can you share details about the model size?

Do you have large deterministics? Does it fail if you don’t keep anything in the trace? Pass var_names=[].

Does it also fail if you compile to numba? Pass compile_kwargs=dict(mode="NUMBA") to pm.sample

1 Like

Thanks @ricardoV94 - good to have you so on the ball!

This is a failure at the start of sampling, it get to the stage of showing this then falls over

Sampling 4 chains, 0 divergences ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- / 0:00:00

What do you mean by model size (an actual metric or something else?). The dataset size is actually very small: in the low hundreds of observations around n=200

It’s relatively complicated, but not more than approx 8 RVs. Interesting that you mention Deterministics: I have 3 models with a similar core architecture and increasing complexity. The first model (call it B0) actually runs fine on this slightly larger dataset. The failures begin at the second model B1 which (well antipicated by you) does include 2 new Deterministics, and Potentials…

I’m trying a few things atm.

  1. I’ve messed about with some nuts_sampler_kwargs that seem to relate to data size n, but the issue still occurs
nuts_sampler_kwargs=dict(
    max_treedepth = ?,
    early_max_treedepth = ?, 
    step_scale = ?
),
  1. I tried var_names=[ ] per your suggestion, but the issue still occurs. Also this issue seems to happen before any traces are made, so I dont think that would be the cause?

  2. I’ll try your numba suggestion and report back. BTW do you mean to also use nutpie at the same time?

Also I have another model A2 which is similar to B1 but slightly simpler along different lines and I can share that one publically. That never failed before, but now I’ll try to stress test it and see if I can make A2 fail in the same way…

No, I mean to use pymc sampler just compiling stuff to Numba

Gotcha, thanks!

I’d also play around with the mp_ctx argument just to rule that out. I think the default is 'fork’, but you can try 'spawn'.

1 Like

Thanks I’ll give it a go, though apparently ‘spawn’ is currently the default for MacOS multiprocessing — Process-based parallelism — Python 3.13.1 documentation

I stand corrected, 'spawn' is the default generally used (fairly recently switched to spawn from fork) for multiprocessing on MacOS, but in pymc the default for MacOS Silicon (or rather, Darwin & arm) is indeed set to 'fork' pymc/pymc/sampling/parallel.py at fa43eba8d682fbf4039bbd1c228c940c872e2d5d · pymc-devs/pymc · GitHub

Well how about that… setting pm.sample(..., mp_ctx='spawn') lets my more complicated model B1 run perfectly - thanks @jessegrabowski for the idea

There’s a couple of URLs in the comments inside pymc.sampling.parallel.ParallelSampler linking to Github chats from 2020 and 2022, and in those a further link to PR Set start method to fork for MacOs ARM devices by bchen93 · Pull Request #6218 · pymc-devs/pymc · GitHub

Perhaps that code can be undone now that we’re a few years down the line?

3 Likes

I suspect that’s one of those that flip-flops over the years as someone finds one method works on their machine and the other does not.

1 Like

Almost certainly! The change over to Apple Silicon etc hasn’t been without a few bumps :smiley: