Parallelizing chains with custom likelihood on multiple cores

Thank you @ricardoV94 for the important point. I edited my response below to reflect this and actually now it is even more evident that parallelization does not seem to work for custom likelihood models (if I’m not doing something else wrong) :sweat_smile:

I really suggest you try to break your problem to the simplest case:

  1. Dont use DeMetropolis for debugging. Use something more standard like Slice or Metropolis Hastings. Use default tuning settings.
  2. Use a realistic example where your likelihood has a meaning. Returning -1 all the time is not good.
  3. Related to both points, make sure your model actually coverges, otherwise you may be just benchmarking edge cases like maximum iteration bounds.

@aseyboldt showed an example where multiprocessing seems to work as expected. I am confident that same example with a custom Op would show benefits of parallelization.

You can try to start from there and slowly modify into your use case to see what actually “breaks” performance. Once you identify the culprit it will be easier to find a workaround.

Also numpy routines can use multithreading by default (just like Pytensor does), meaning that 1 vs many chains may not be fair. You can use pure python operations or play with environment variables to make sure one thread per process is used:

Those are good points, thank you for bringing them up!

  1. The reason I used DEMetropolis is because that is what I am using for inference in my “real” model. I tried with Metropolis below.
  2. I tried to write an example where the likelihood should be informative, and the model should converge, at least in theory. In the proposed model below the solution should be that of minimal L2 norm, so a vector of zeros. The Metropolis sampler is getting there, but as expected it would take infinitely long to converge!
  3. The model is not converging with the proposed draws but it should converge eventually. If you think it makes a difference I can run it for longer on a server.

I’m not sure how to make an intermediate step between the pure pyMC example given by @aseyboldt and the custom likelihood example. What I attach here is the simplest likelihood example I can imagine. I also followed your advice of turning off multithreading.

Finally, apologies if this appears as scam. I’m mainly running pyMC with custom likelihoods since I am calling external models, that’s why I focus on this one here. And in my example where the forward operation takes about a second (And inference takes a few weeks) the parallelization really makes a difference :slight_smile: Thanks again for the input!

import numpy as np
import pymc as pm
import pytensor.tensor as pt
import os

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

class LogLike(pt.Op):
    itypes = [pt.dvector]
    otypes = [pt.dscalar]

    def __init__(self, custom_function):
        self.likelihood = custom_function

    def perform(self, node, inputs, outputs):
        logl = self.likelihood(inputs[0])
        outputs[0][0] = np.array(logl)  # output the log-likelihood

def log_likelihood(A):
    svd = np.linalg.svd(np.outer(A, A)) # adding this to delay a bit the forward solver but not using it in solution
    return -np.linalg.norm(A)


import time
draws = 500
variables = 100
times = []
for cores in [3, 6, 12]:
    start = time.time()
    var_names = ['var_' + str(num) for num in range(variables)]
    with pm.Model(coords={"A": var_names}):
        A = pm.Uniform("varA", upper=1, lower=-1, dims='A')
        A = pt.as_tensor_variable(A)
        log_l = LogLike(log_likelihood)
        pm.Deterministic("likelihood_det", log_l(A))
        pm.Potential("likelihood", log_l(A))
        idata = pm.sample(draws=draws,
                          step=pm.Metropolis(),
                          cores=cores,
                          chains=cores,
                          tune=draws,
                          progressbar=False,
                          discard_tuned_samples=False,
                          compute_convergence_checks=False)
    end = time.time()
    times.append(end - start)
print(times)

I think the problem is the deterministic “likelihood_det”. It is not computed on the worker processes, but in the main process. This isn’t typically much of an issue with HMC based samplers, because each draw takes much longer than a metropolis draw.

1 Like

I tried exactly the same run as above and having Deterministic disabled but the inference times were still similar:

[169.3852207660675, 218.95040702819824, 387.6696081161499]

seconds for 3, 6 and 12 cores.

The increase in computation time is not linearly related to the cores so there is probably some parallelization taking place. But the gain in computation time for the custom likelihood and Metropolis or DMetropolis samplers seems marginal. If there’s no other ideas of test we can close the post and assign the first answer from @aseyboldt as the solution. It does not address the custom likelihood but is the most informative anwser in the thread.

Hm, that’s not what I expected…

If I run the code you posted, just comment out the pm.Deterministic, I get

[223.85601568222046, 232.24450850486755, 258.4911205768585]

And the expected number of cores is active as well.
What machine are you running this on? Does it have 12 cores (as opposed to hardware threads)? Could also be that is clocks down with a lot of active cores? I guess it could also be because of a different version of pymc or pytensor, but I don’t really see where that would be coming from… (I’m pretty close to main for both of those, but I’m not aware of recent changes that would do something like this).

Strange! I am running this on a Lenovo ThinkPad T14 Gen 2a, I see 16 CPUs in the system monitor and 32 GB ram. I’m on Ubuntu 22.04, with python 3.11, pymc 5.0.1, pytensor 2.8.11, numpy 1.23.0 When I run the script it seems like my cores are up and running, but the times I get are consistently similar to what I reported above. I get the same results on a server running also on Ubuntu. In anycase if it works for you then probably its something from my side that’s wrong. Really appreciate your time and effort in explaining everything and troubleshooting along with me :pray:

That sounds like your machine has 8 cores, and each of those cores has two hardware threads. Most system monitor displays don’t show that, and it just looks like there are 16 cores then. (You wouldn’t be the first to be surprised by that :slight_smile: ).
Most of the time only the number of true cores really matters for numerics.
You are welcome, glad we could help at least somewhat :slight_smile:

1 Like

that would make sense! This is what throws me off then…