Kernel disconnects/crashes randomly when looping through models

Hi, I’m having some trouble using Pymc3 . I have a model that depends on marginalizing over certain parameters, in such a way that it isn’t differentiable with respect to them (or at least, they didn’t play well with Theano when I tried, and I don’t know enough about Theano to attempt to fix it). Therefore, I have to create a whole bunch of different models using loops, and save them all to file for later aggregation and analysis. I know that this isn’t exactly “best practices” but it was all I could think to do.

My problem is that at random times, the kernel will just die completely, and is unable to be restarted. This is a huge problem because typically the whole process takes many hours, and I can’t just sit around babysitting it for when it breaks.

Here’s the raw data for the notebook (copypaste and save as .ipynb), and here’s the data csv it references. I’ve reproduced the problematic loop part here:

latency_mean = 12
latency_var = 5.433
latency_parameters = lognormal_parameters(latency_mean, latency_var)
stay_at_home = 1
saturated = False

n_samples = 500
n_tune = 1500
n_chains = 4

model_name = "model_sah_"+str(int(stay_at_home))+"_sat_"+str(int(saturated))+"_"
marginalization_mc = 100
for i in range(marginalization_mc):
    print(i)
    outbreak = Observed_Outbreak(patients_data)
    outbreak.impute_latencies(latency_parameters)
    outbreak.get_global_infectious()
    outbreak.get_susceptible_states()

    outbreak.get_group_infectious("Household_ID")
    outbreak.get_group_infectious("Family_ID")
    outbreak.get_group_infectious("Sex")
    outbreak.get_classroom_infectious(1, infectious_override=stay_at_home)
    outbreak.get_classroom_infectious(2, infectious_override=stay_at_home)

    community_slab = outbreak.slabs["Community"]
    c1_slab = outbreak.slabs["Classroom_1"]
    c2_slab = outbreak.slabs["Classroom_2"]
    household_slab = outbreak.slabs["Household_ID"]
    family_slab = outbreak.slabs["Family_ID"]
    sex_slab = outbreak.slabs["Sex"]
    s_before = outbreak.states_before
    s_after = outbreak.states_after

    this_model = pm.Model()
    with this_model:
        rv_q_community = pm.Uniform("rv_q_community")
        ##rv_q_class_0 = pm.Uniform("rv_q_class_0")
        rv_q_class_1 = pm.Uniform("rv_q_class_1")
        rv_q_class_2 = pm.Uniform("rv_q_class_2")
        rv_q_household = pm.Uniform("rv_q_household")
        rv_q_family = pm.Uniform("rv_q_family")
        rv_q_sex = pm.Uniform("rv_q_sex")

        probabilities = ((rv_q_community**community_slab) *
                         (rv_q_class_1**c1_slab) *
                         (rv_q_class_2**c2_slab) *
                         (rv_q_household**household_slab) *
                         (rv_q_family**family_slab) *
                         (rv_q_sex**sex_slab))

        ## probability is zero if already not susceptible
        probabilities = s_before * probabilities
        s_observed = pm.Bernoulli("s_observed", p=probabilities, observed=s_after)
        this_trace = pm.sample(draws=n_samples, cores=n_chains,
                               chains=n_chains, tune=n_tune)
        ##pm.traceplot(this_trace)
        ##print("Likelihood sampled using MCMC NUTS:")
        ##plt.show()
        print(pm.summary(this_trace))


    now = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    dump_filename = "models/"+model_name+now+".pkl"
    os.makedirs(os.path.dirname(dump_filename), exist_ok=True)
    model_parameters = {"latency_parameters": latency_parameters,
                        "stay_at_home": stay_at_home,
                        "saturated": saturated,
                        "n_samples": n_samples,
                        "n_tunes": n_tune,
                        "n_chains": n_chains}

    with open(dump_filename, "wb") as the_file:
        pickle.dump({"model_parameters": model_parameters,
                     "latencies": outbreak.data_frame["Latency_Onset"],
                     "model": this_model,
                     "trace": this_trace}, the_file)

    print(f"Model successfully dumped to file: {dump_filename}.")

print("Marginalization finished.")

I stress that ordinarily the model works absolutely fine – when I run this just a single time outside of a loop, it always samples properly and I get sensible posterior distributions for the parameters. The problem comes when I loop it many times. Typically it will break partway through sampling, e.g. here’s where it got to last time:

2
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [rv_q_sex, rv_q_family, rv_q_household, rv_q_class_2, rv_q_class_1, 
rv_q_community]
Sampling 4 chains:   0%|          | 14/8000 [00:01<24:22,  5.46draws/s]

After that, kernel crash. It doesn’t even give an error message. Sometimes it manages dozens of loops successfully, other times it breaks after the first few.

I’m using pymc3 version 3.5, installed using “conda install -c conda-forge pymc3”. Here’s a full list of my installed packages. How do I fix this?

Since your model remain unchange over different runs, I suggest you define the model once outside of the loop, use a theano.shared variable as your observation, and inside the for loop just update the shared variable using {RV}.set_value

Thanks, I’ll try that and report back in a few hours …

Nope, this didn’t fix it. Broke after 18 loops:

(This is locally btw, not over a connection).

This was the last thing in the console before the crash:

18

Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess 
sampling (4 chains in 4 jobs) NUTS: [rv_q_sex, rv_q_family, rv_q_household, 
rv_q_class_2, rv_q_class_1, rv_q_community]

This is what I used instead of remaking the model each time, did I do it wrong? Or is this a problem with Jupyter?

I tried doing it in IDLE, but it crashed with:

Model created.
0
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [rv_q_sex, rv_q_family, rv_q_household, rv_q_class_2, rv_q_class_1, rv_q_community]
Traceback (most recent call last):
  File "D:\JupyterNotebooks\measles_analysis_3.py", line 244, in <module>
	progressbar=False)
  File "D:\Anaconda3\lib\site-packages\pymc3\sampling.py", line 449, in sample
	trace = _mp_sample(**sample_args)
  File "D:\Anaconda3\lib\site-packages\pymc3\sampling.py", line 996, in _mp_sample
	chain, progressbar)
  File "D:\Anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 275, in __init__
	for chain, seed, start in zip(range(chains), seeds, start_points)
  File "D:\Anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 275, in <listcomp>
	for chain, seed, start in zip(range(chains), seeds, start_points)
  File "D:\Anaconda3\lib\site-packages\pymc3\parallel_sampling.py", line 182, in __init__
	self._process.start()
  File "D:\Anaconda3\lib\multiprocessing\process.py", line 105, in start
	self._popen = self._Popen(self)
  File "D:\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
	return _default_context.get_context().Process._Popen(process_obj)
  File "D:\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
	return Popen(process_obj)
  File "D:\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
	reduction.dump(process_obj, to_child)
  File "D:\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
	ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe

Same thing in Spyder.

At least it does save the models to file properly each time, so I can in theory drag this thing to the finish line by manually restarting every time it breaks.

Hi, I tried running your jupyter notebook but there are some problems maybe due to package. Anyways, I had faced similar problem. The solution might be to use to use MAP or other approximation methods. pm.sample() didn’t even work for me quite a few times however, I am not sure why.

If you care about the quality of your inference you really should not be using MAP, especially if your model is complicated (i.e., linear regression is probably the only case that is ok).

@dain - the hyperlink to the code is not pasted correctly. And you are sure it works with i = 18 when you run outside of the loop right? I think it might be a memory problem…

Oh gosh, sorry about the link. Here it is: https://pastebin.com/U60vALUw

As for your question: the i term doesn’t enter into the model itself, it’s just there so I can keep track of how many iterations of the loop have occurred. Each iteration of the loop is not systematically different from the others; the differences in the input data come from the different realisations of the “impute_latencies” method on the “Observed_Outbreak” class, called in each loop. The details for it are complicated but basically they’re all i.i.d. draws from the same distribution.

I don’t know how it can be a memory issue; since sometimes it fails on the very first loop, and other times it gets many dozens of iterations in before breaking. I flush the entire kernel every time I restart, so it’s not influenced by previous runs.

Hmmm the error message is pretty weird - maybe try setting cores=1

1 Like

Running it with cores=1 now … I’ll be back probably tomorrow if/when it finishes.

Yeah it worked.

1 Like

I also meet these problems a couple of days ago when I studied one of the book’s codes
https://github.com/PacktPublishing/Bayesian-Analysis-with-Python/blob/master/Chapter%204/04_Understanding_and_predicting_data_with_linear_regression_models%20(2).ipynb

In this code when it comes to some complicated model, the kernel always shut down with Broken pipe error in Pymc3 3.5, for example the model in “Hierarchical linear regression”.

The book is written in Pymc3 3.0. When I change version to 3.0, it works well. I am really curious about why. Right now , I have to use the 3.0 version to avoid broken pipe error, but there seem to be a lot of improvements in the 3.5 version.

@aloctavodia is updating the book I think - maybe you can raise an issue on the repository.

The code in this repository should be up to date. https://github.com/aloctavodia/BAP now I am away of my computer, but next Monday I will be able to check this in more detail.