Running Pymc3 fast posterior sample in a multi-node cluster getting stuck

I am running a pymc3 model multiple times in a loop to estimate posterior distributions, every iteration I change the data using pm.set_data

        with model:            
            pm.set_data({"gw_pump_semi": pump,
                 "gw_pump_semi_lag": pump_lag,
                 "id_wtr_yr_lag": [wtr_yr_lag]*2,
                 "id_wtr_yr": [wtr_yr]*2})
    
            p_post = pm.fast_sample_posterior_predictive(trace=gwtrace,samples=400, random_seed=800,var_names=["depth_like"])["depth_like"]

Before I start the parallel computing process I define the pm.model() as model and load the trace that I estimated beforehand. Then each process in parallel calls the model and uses the trace to use pm.fast_sample_posterior_predictive. It works perfectly, however after a a couple of hundred of iterations it gets slower and eventually stops. I was thinking it was a memory leak and I tried solving it using the suggestions in: https://github.com/pymc-devs/pymc/issues/1959 using a multiprocessing in the function

But this is still happening, I really need help! Is for my PhD research and I am trying to run this function close to a million times.

Pymc3 = 3.11.2
theano-pymc=1.1.2
python 3.9.7

Installed using conda install -c conda-forge pymc3 theano-pymc mkl mkl-service
The cluster has linux

1 Like

This was solved using something similar to the multiprocessing suggestion but improting each iteration the trace and defining the model

1 Like

Out of curiosity, does this also happen with 3.11.4? And how are you loading the trace?

I didn’t try with the 3.11.4 version and I first pickled the trace and each iteration I use pickle to load it

Would you be willing to test this behaviour with v4? It looks like an interesting real world test case

Basically everything about saving traces and posterior predictive has changed and I think you won’t have this issues anymore. Depending on how complex your model is it might be a good idea to wait until 4.0, otherwise 4.beta3 (which will hopefully be released soon) should be enough.

As an example of changes, now sampling returns inferencedata which you can save as netcdf or zarr instead of depending on pickle (you can already do that in v3 but it’s opt in)