Segmentation fault (core dumped) on running pm.sample in Ubuntu

jithendaraa · July 26, 2020, 4:03pm

This was my implementation of a model in PyMC3. I was trying to find out how long the program takes to run for different amount of observations. When I give around 3000 or 5000 observations, the program runs fine. When I increase it to around 7000 or above, it runs fine till the pm.sample() line. Upon reaching the sampling statement, the program gives me a segmentation fault (core dumped) error. Is this normal/how do I fix this? Also, why does this happen?

with pm.Model() as hotDINA:    
    # Priors: theta, bk, ak, learn_k, ones, ss_k, g_k
    theta   = pm.Normal('theta', mu=0.0, sd=1.0, shape=(I, 1))
    lambda0 = pm.Normal('lambda0', mu=0.0, sd=1.0, shape=(K, 1))    #bk
    lambda1 = pm.Uniform('lambda1', 0.0, 2.5, shape=(K, 1))    #ak
    learn   = pm.Beta('learn', alpha=1, beta=1, shape=(K, 1))
    ones    = pm.Bernoulli('known', p=1, shape=(K, 1))
    ss      = pm.Uniform('ss', 0.5, 1.0, shape=(K, 1))
    g       = pm.Uniform('g', 0, 0.5, shape=(K, 1))
    for i in range(I):
        print("STUDENT", i+1, " out of", I)
        
        # t = 0
        for k in range(K):
            prob[i][0][k] = pm.math.invlogit((1.7) * lambda1[k,0] * (theta[i,0] - lambda0[k,0]))
            alpha_name = 'alpha[' + str(i) + ',0,' + str(k) + ']'
            alpha[i][0][k] = pm.Bernoulli(alpha_name, prob[i][0][k])
            
        for s in range(MAXSKILLS):
            idx = int(idxY[i][0][s] - 1)
            if idx >= K: continue
            py[i][0][idx] = pow(ss[idx,0], alpha[i][0][idx]) * pow(g[idx,0], (1-alpha[i][0][idx]))
        
        # t = 1,2...T[i]-1
        for t in tqdm(range(1, T[i])):
            for k in range(K):
                alpha[i][t][k] = pm.math.switch(alpha[i][t-1][k], ones[k,0], learn[k,0])
                
            for s in range(MAXSKILLS):
                idx = int(idxY[i][t][s] - 1)
                if idx >= K: continue
                py[i][t][idx] = pow(ss[idx,0], alpha[i][t][idx]) * pow(g[idx,0], (1-alpha[i][t][idx]))
            
        for t in tqdm(range(T[i])):
            for s in range(MAXSKILLS):
                idx = int(idxY[i][t][s] - 1)
                if idx >= K: continue
                obsData = pm.Minibatch(observed_data[i][idx][t], batch_size=batch_size)
                Y[i][t][idx] = pm.Bernoulli(f'y_{i}_{t}_{idx}', p=py[i][t][idx], observed=obsData)

    start = time.time()
    print(start)
    trace = pm.sample(2500, tune=2500)
    end = time.time()
    print("TIME: ", end - start)
    pm.save_trace(trace=trace, directory=".pymc_1.trace", overwrite=True)
    print("SAVED")
    summary_df = pm.stats.summary(trace)
    summary_df.to_excel("summary.xlsx")
    print("TOTAL END: ", time.time() - total_start)

Does this have more to do with theano rather than pymc3?
Versions:

PyMC3 Version: 3.9.2
Theano Version: 1.0.4
Python Version: 3.7.4
Operating system: Ubuntu 16
How did you install PyMC3: pip

Elenchus · July 27, 2020, 5:04am

I’m not overly familiar with either theano or pymc3, but when I’ve seen segfaults in python it’s been because I have too much in memory - maybe check your memory usage

jithendaraa · July 27, 2020, 11:42am

Yes, my RAM seems to be full. However, when I run this on the server by submitting a .job script, my .out file gives the final output for small amounts of data but doesnt give an output for even slightly larger data. It just keeps running forever (for a day, when I expect it to be completed in 2-3 hours max). But my .out file doesn’t contain a segfault error statement, is this normal if I’m using SLURM for Job scheduling? Anyway my guess is that it runs out of RAM too (when I request a single 128GB node). Going by your previous comment, does this mean it is supposed to work properly once I request more computational nodes (say 4 instead of 1)?

Elenchus · July 27, 2020, 12:20pm

Out of my depth there, sorry. It looks like SLURM should throw an error if you use too much memory - you could try querying the memory usage with sacct, as mentioned in that link, to see what’s going on.

nkaimcaudle · July 27, 2020, 1:15pm

Are you able to vectorise the computations rather than use python for loops? Also try without tqdm.

The start may look something like:

I, K = 16, 4
with pm.Model() as hotDINA:    
    # Priors: theta, bk, ak, learn_k, ones, ss_k, g_k
    theta   = pm.Normal('theta', mu=0.0, sd=1.0, shape=(I, 1))
    lambda0 = pm.Normal('lambda0', mu=0.0, sd=1.0, shape=(K, 1))    #bk
    lambda1 = pm.Uniform('lambda1', 0.0, 2.5, shape=(K, 1))    #ak
    learn   = pm.Beta('learn', alpha=1, beta=1, shape=(K, 1))
    ones    = pm.Bernoulli('known', p=1, shape=(K, 1))
    ss      = pm.Uniform('ss', 0.5, 1.0, shape=(K, 1))
    g       = pm.Uniform('g', 0, 0.5, shape=(K, 1))
    
    prob = pm.math.invlogit((1.7) * lambda1.T * (theta - lambda0.T))
    alpha = pm.Bernoulli('alpha', p=prob, shape=(I, K))

Topic		Replies	Views
Segmentation Fault with pm.MatrixNormal Questions	6	580	August 15, 2019
Pm.sample gets stuck after init with cores > 1 Questions	17	3944	January 4, 2021
Pymc3 getting stuck after initialization Questions	41	8897	February 9, 2022
Initial evaluation of model at starting point failed when running a mixture model Questions bug	2	1201	October 8, 2021
Fatal Python error: Segmentation fault Questions	4	1077	September 12, 2022

Segmentation fault (core dumped) on running pm.sample in Ubuntu

Related topics