How to choose the number of CPUs and Memory Size for Bayesian models using PyMC

Hi PyMC community,

I am working on a Bayesian modelling project using PyMC (> 5) and nutpie (=0.5.1). I have some findings and questions to share:

  1. Environment: I create a conda environment with python 3.10.9 and nutpie=0.5.1 (the latest version at conda-forge as of March 2023) in a Linux virtual machine instance. Without specifying the version of pymc, installing nutpie=0.5.1 would automatically choose PyMC=5.0.2. If I install PyMC>=5.1.x first, installing nutpie=0.5.1 via conda will downgrade PyMC back to 5.0.2. I am totally fine with this combination of PyMC and nutpie.

  2. Chains and CPUs: pm.sample() enables users to choose the number of chains and the number of cores in the sampling process. My understanding is that one chain can be processed by one CPU at most and the maximum number of cores PyMC can leverage is 4. Thus, I might need to change to a virtual machine instance with 4 CPUs at most if I am going to sample 4 chains and I would waste the resource if I choose 8 CPUs. Then how about 8 chains? Does using 8 chains help in providing more reliable estimates?

  3. Memory size: Does PyMC use a large amount of RAM in the sampling process? Should I increase the size of RAM as I increase the number of draws and tunes?

  4. My python kernel dies if I am estimating a hierarchical Bayesian model with large numbers of parameters. For example, I am tweaking a two-layer hierarchical Bayesian model with 1700+ observations (number of rows in my data frame), 30+ variables (number of columns), and 300+ parameters (number of hyperparameters and parameters). Kernel dies, as I run pm.sample(draws=1000, tunes=1000, chains=2, cores=4). Kernel might not die, if I switch from a machine instance with 4 CPUs and 16 G Memory to another 72 and 100+G Memory. Any thoughts?

By discussing in this post, I get the knowledge to clear questions in my mind: