How to do parallel processing for power analysis?

Goal is similar to this thread: Bayesian sample size estimation for given HPD.

The approach is similar to the comment:

# define your model
n = theano.shared(100, dtype=int))
X = theano.shared(50, dtype=float))
with model:
    p = pm.Beta('p', alpha=2, beta=2)
    y_obs = pm.Binomial('y_obs', p=p, n=n, observed=X)

sample_size = [10, 100, 1000, 10000]
# for loop
for s in sample_size:
    n.set_value(s)
    X.set_value(np.sum(new_observed_X))
    with model:
        trace = pm.sample()
    # compute HPD.

Is there a recommended approach to parallelize the loop for increasing sample sizes? So far I have tried using an EC2 with 96 cores, but that didn’t really offer a speed up. I have also tried using python’s concurrent.futures library to parallelize the loop, but it doesn’t offer any speedup. The only other idea I have is to run each model on a separate instance (e.g. with kubernetes).

Any thoughts are greatly appreciated!

does multiprocess not work? Something like

from multiprocessing import Pool 

def do_power(n):
  new_observed_X = gen_observed(n)
  with pm.Model() as mod:
    p = pm.Beta('p', alpha=2, beta=2)
    y_obs = pm.Binomial('y_obs', p=p, n=n, observed=new_observed_X)
    tr = pm.sample(chains=4,cores=1)
  return tr

pool = Pool(6) # or whatever
trace_list = pool.map(do_power, [10, 100, 1000, 10000])

@chartl I tried using concurrent.futures multiprocessing, as well as an approach similar to yours. I’m getting this output:

INFO (theano.gof.compilelock): Waiting for existing lock by process '6288' (I am process '6289')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.7.1-64/lock_dir

It seems like it is attempting to run in parallel, but theano is preventing it. The overall run time is the same as doing a for loop.

Is there another approach that you know of? The full run of the power analysis takes about 24 hours.