Running Pymc3 fast posterior sample in a multi-node cluster getting stuck

josemrodriguezf · February 6, 2022, 1:47am

I am running a pymc3 model multiple times in a loop to estimate posterior distributions, every iteration I change the data using pm.set_data

        with model:            
            pm.set_data({"gw_pump_semi": pump,
                 "gw_pump_semi_lag": pump_lag,
                 "id_wtr_yr_lag": [wtr_yr_lag]*2,
                 "id_wtr_yr": [wtr_yr]*2})
    
            p_post = pm.fast_sample_posterior_predictive(trace=gwtrace,samples=400, random_seed=800,var_names=["depth_like"])["depth_like"]

Before I start the parallel computing process I define the pm.model() as model and load the trace that I estimated beforehand. Then each process in parallel calls the model and uses the trace to use pm.fast_sample_posterior_predictive. It works perfectly, however after a a couple of hundred of iterations it gets slower and eventually stops. I was thinking it was a memory leak and I tried solving it using the suggestions in: https://github.com/pymc-devs/pymc/issues/1959 using a multiprocessing in the function

But this is still happening, I really need help! Is for my PhD research and I am trying to run this function close to a million times.

Pymc3 = 3.11.2
theano-pymc=1.1.2
python 3.9.7

Installed using conda install -c conda-forge pymc3 theano-pymc mkl mkl-service
The cluster has linux

josemrodriguezf · February 7, 2022, 6:34pm

This was solved using something similar to the multiprocessing suggestion but improting each iteration the trace and defining the model

OriolAbril · February 8, 2022, 12:48am

Out of curiosity, does this also happen with 3.11.4? And how are you loading the trace?

josemrodriguezf · February 8, 2022, 1:06am

I didn’t try with the 3.11.4 version and I first pickled the trace and each iteration I use pickle to load it

OriolAbril · February 8, 2022, 1:45pm

Would you be willing to test this behaviour with v4? It looks like an interesting real world test case

Basically everything about saving traces and posterior predictive has changed and I think you won’t have this issues anymore. Depending on how complex your model is it might be a good idea to wait until 4.0, otherwise 4.beta3 (which will hopefully be released soon) should be enough.

As an example of changes, now sampling returns inferencedata which you can save as netcdf or zarr instead of depending on pickle (you can already do that in v3 but it’s opt in)

Topic		Replies	Views
Issues parallelizing pymc3 model with the `multiprocessing` library Questions	4	2918	July 12, 2021
Problem with multiprocessing in PyMC3 Questions	5	3702	August 20, 2018
Nested Parallel in PyMC3 Questions	5	1028	December 6, 2020
How to run PyMC3 in a multi-node cluster? Is it possible at the moment? Questions	12	2575	December 2, 2021
Sampling running very slowly for all models? Questions	1	846	April 27, 2020

Running Pymc3 fast posterior sample in a multi-node cluster getting stuck

Related topics