Only one chain running, three are stuck

M_K-C · August 7, 2023, 10:27am

Hi all,

this problem is a bit of a strange one: When running my current model with pm.sample() only one of the four changes is exploring the posterior, while the other three are stuck at their initial values and don’t change.
About the model:
I am trying to infer parameters from a multidimensional dataset (4x2500) using a model I have written in PyTensor. A custom PyTensor function is part of the model. There are multiple combinations of parameters that could explain the data so the posterior may be rather flat at regions.

Everything runs fine, if I set chains=1.
Are there any things that could lead to this effect of only one chain running? Maybe I need to specify the parallel chains in the custom PyTensor function?

Any help or ideas are highly appreciated…

M_K-C · August 14, 2023, 1:58pm

Update: This issue disappears when I use pm.Metropolis(). Therefore I think it is related to one of two things:
a) the ‘flatness’ of the posterior that makes the NUTS sampler struggle somehow
or
b) an error in the grad() function inside my custom PyTensor function, as that one won’t be used by the Metropolis-sampler.

For now I’ll stick with Metropolis, but if you have any other suggestions, I’d be happy to hear them.

cluhmann · August 14, 2023, 2:33pm

Is this issue unique to this model? Do other models sample well with multiple chains?

M_K-C · August 14, 2023, 2:43pm

Yes, it seems to be unique for this model with the custom pyTensor function.
Other models run fine with multiple chains.

ricardoV94 · August 14, 2023, 4:39pm

Can you expand on that? Maybe share some code?

M_K-C · August 15, 2023, 1:56pm

Yes, this function is the result of another Discourse post:

class RootFinder(Op):
        
    def __init__(self, t):
        
        self.t= t
        
    def make_node(self, D, a, b):

        outputs = [at.vector(dtype='float64')]
        return Apply(self, [D, a, b], outputs)
    
    def perform(self, node, inputs, outputs_storage):
        D, a, b = inputs
        
        outputs_storage[0][0] = root_finder_loop_perform(D, a, b, self.t)
    
    def grad(self, inputs, output_gradients):
        D, a, b = inputs
        x_list= self(D, a, b)
        
        x_grad_list = a*D/((a**2*b)/4 - D**2*x_list**2) + 2*a*D**3*x_list**2/((a**2*b)/4 - D**2*x_list**2)**2
        D_grad_list = 4*a*(a**2*b*x_list+ 4*D**2*x_list**3)/(a**2*b-4*D**2*x_list**2)**2
        a_grad_list = -4*D*x_list*(a**2*b + 4*D**2*x_list**2)/(a**2*b-4*D**2*x_list**2)**2
        b_grad_list = -4*a**3*D*x_list/(a**2*b - 4*D**2*x_list**2)**2
        
        grad_D = at.dot((-D_grad_list/x_grad_list), output_gradients[0])
        grad_a = at.dot((-a_grad_list/x_grad_list), output_gradients[0])
        grad_b = at.dot((-b_grad_list/x_grad_list), output_gradients[0])
       
        return grad_D, grad_a, grad_b

where, root_finder_loop_perform() is a function that returns the first n roots of a function. D, a and b are random scalar variables.The grad() function is expected to return three zero-dimensional outputs. I have confirmed that it works with grad_verify() and it runs smoothly with a single chain.

I then call the function via:

rootfinder = RootFinder(t=t)
x_list = rootfinder(D, a, b)

and use the n-dimensional x_list values in the next steps.
As I said previously, this works fine for a single chain, but doesn’t seem to like multiple chains.
Any help is highly appreciated!

jessegrabowski · August 15, 2023, 3:03pm

What gets called in root_finder_loop_perform, and does it use multiple cores? There might be some multiprocessing scheduling issues resulting from each PyMC worker spawning lots of workers that don’t know about each other.

M_K-C · August 15, 2023, 3:44pm

It calls scipy.optimize.shgo() to find the n roots of a function. Is that known to use multiple cores?

cluhmann · August 15, 2023, 5:20pm

Very possibly?

jessegrabowski · August 15, 2023, 8:35pm

Does the problem persist when you set workers=1 on shgo? Might be barking up the wrong tree entirely, but that would be the test.

M_K-C · August 15, 2023, 9:00pm

Yes, I just tried that and it doesn’t solve the issue. But maybe the shgo algorithm is less effective with workers = 1 and hence lead to more problems. Maybe using pm.Metropolis() is the best way in this case…

Topic		Replies	Views
Parallelizing chains with custom likelihood on multiple cores v5	29	2837	March 24, 2023
New User - NUTS Running forever on regression tutorial	8	118	February 14, 2025
Combining parameters into single tensor Questions	19	3515	October 30, 2017
Chains parameter and Multiprocess sampling Questions	6	4414	August 30, 2018
Pm.sample_smc fails if chains > 1 on Mac OS v5	8	293	March 12, 2024

Only one chain running, three are stuck

Related topics