Sample with multiple cores

PyMC novice here. I have a fairly simple (I think) model with 4 RVs, two predictor variables, about 200 observations:

with pm.Model() as mod:
    
    alpha = pm.Exponential("alpha_0", 1.0)
    beta = pm.Lognormal("beta", 1.0, 1.0)
    nba = pm.Lognormal("nba", 1.0, 1.0)
    
    e_dn = dn
    
    inf_exp = beta * pm.math.exp(-alpha * dn) * roll
    
    inf_obs = pm.NegativeBinomial("inf_obs", mu=inf_exp, alpha=nba, observed=cases)

(This is the basic version of a more complex model I would like to run, so if there are potential scaling issues for a larger model I would like to know too!)

When I try to run sample with:

with mod:
    trace = pm.sample(100000, tune=50000, cores=ncores)
  • If ncores = 1, it runs okay
  • If 1 < ncores <= 8, it takes increasingly long to initialise, and sometimes eventually runs
  • If 8 < ncores, it basically never finishes initialising and I have to restart the Jupyter kernel

I’m running on a research computing cluster which is a Windows 10 Enterprise virtual machine with 48 cores. I need to run many (~200) separate iterations of this model, or ideally a more complex version of it, so would like to take full advantage of available computing resources to run it as fast as possible.

Based on these other answers it seems there’s no straightforward way on Windows 10 to parallelise running sample even with lots of cores? Any suggestions or solutions welcome :confused:

Apparently there have been a few issues with parallel processing on Windows recently. See here and maybe confirm you’re using the latest version

Just updated to v3.9.3 from v3.8, but still having the same issue…

FWIW I’m not getting any runtime errors or anything. Instead I get:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...

And then after a bit longer (couple minutes usually):

Multiprocess sampling (32 chains in 32 jobs)
NUTS: [nba, beta, alpha_0]

And then… nothing happens.

With further testing, seems like I can more or less consistently get it to run with low numbers of cores (ncores <= 8, sometimes <= 16).

With that said, is there maybe some workaround way to efficiently use the available cores to run this? Could I for instance use multiprocessing as part of my overall workflow to try to run sample multiple times concurrently on separate iterations of the model, or…?