Custom PyTensor Op for Root-Finding with Multiple Inputs and Outputs

Hi all,

I am relatively new to PyMC and am still trying to understand, how everything works. For my current project I need to estimate a diffusion equation numerically as desribed here.
After trying different things in PyTensor and Aesara, I realized that I would need to write my own Op to run a root-finding algorithm, which can find the first 4 roots of the equation (23) in the cited work.

My implementation is:

import pymc as pm
import pytensor.tensor as at
from pytensor import *
from pytensor.graph.op import Op
from pytensor.graph.basic import Apply
from scipy.optimize import fsolve

def root_finder_loop_perform(D, Sf, Sb, thickness):
    # Define an Aesara expression for the function to be rooted
    def f(x, D, Sf, Sb, thickness):
        return np.tan(x*thickness*1e-7)+(D*(Sb + Sf)*x)/(Sf*Sb-D**2*x**2)
        
       # Set up an array to store the roots for each value of a
    root_list = np.zeros((4))
        
    # Loop over the values of a and find the roots
    # Set the initial guess for the root
    x = 100.
    b = 0
    # Loop over the first four roots
    while b < len(root_list):
        # Find the root using scipy.optimize.root
        sol = fsolve(f, x, args=(D,Sf,Sb, thickness))#, fprime=jac)
        
        # Check if the root is unique and add it to the root list if it is
        if np.round(sol,0) not in np.round(root_list,0) and np.round(sol,0) > 0:
            root_list[b] = sol
            b += 1
        else:
            x = x*1.1

    return np.array(root_list)


class RootFinder(Op):
    
    def __init__(self, thickness):
        self.thickness = thickness
        
    
    def make_node(self, D, Sf, Sb):
        outputs = [at.vector(dtype='float64')]
        return Apply(self, [D, Sf, Sb], outputs)
    
    
    def perform(self, node, inputs, outputs_storage):
        D, Sf, Sb = inputs
        
        root_list = root_finder_loop_perform(D, Sf, Sb, self.thickness)
        outputs_storage[0][0] = root_list
        
    
    def grad(self, inputs, grads):
        D, Sf, Sb = inputs

        x_grad = self(D, Sf, Sb)
    
        D_grad_list = (Sf+Sb)*x_grad*(Sf*Sb*x_grad-D**2*x_grad**3)/(Sf*Sb-D**2*x_grad**2)**2
        Sf_grad_list = -D*x_grad*(Sb**2+D**2 * x_grad**2)/(Sf*Sb-D**2*x_grad**2)**2
        Sb_grad_list = -D*x_grad*(Sf**2+D**2 * x_grad**2)/(Sf*Sb-D**2*x_grad**2)**2
        
        # Compute the gradients with respect to D, Sf, and Sb
        return at.dot(D_grad_list,grads[0]), at.dot(Sf_grad_list,grads[0]), at.dot(Sb_grad_list,grads[0])

Where D, Sb and Sf are random variables and thickness is a constant.

In the main part of the code I then use:

rootfinder = RootFinder(thickness=thickness)
beta = rootfinder(D, Sf, Sb)

While this does run with step = pm.NUTS, it is very slow (several s/it) and I am worried that there is an issue with the implementation of the grad() method. I also don’t fully understand which gradients are stored in grads[0]. Any help would be appreciated!
Thank you already…

Tentative CC @aseyboldt

I am worried that there is an issue with the implementation of the grad() method.

While not a complete answer, I believe you can check that the gradients are correctly implemented with pytensor.gradient.verify_grad (this compares your analytic gradients to those estimated via finite-difference)

Also a minor point, scipy classifies fsolve as an unmaintained legacy function, root is preferred (and equivalent when method='hybr').

You could also consider using a global solver (optimize.shgo), instead of looking for roots one-by-one. With sampling_method='sobol', the shgo optimizer returns all local minimum candidates (which you could then check/rank if you only wanted 4, but in my testing I found there were many more roots than 4 on the interval [-pi/2, pi/2]).

I wanted to give an example of using shgo, but I didn’t fully grasp your problem setup (like the expected range of the parameters, or why you seek only 4 roots).

3 Likes

In addition to what @jessegrabowski said:

  • Control of which zeros you actually compute sounds important to me. If the solver switches between different zeros during sampling, or even just changes the order of the zeros, that might break the gradients and make the logp function discontinuous, which breaks all manner of assumptions. I think optimize.shgo should be able to do that.
  • I didn’t go through it all the way, so I’m not 100% sure, but the gradients don’t look right on first glance at least. What should be in there is something along the lines of this: For each of the three inputs, you use the the implicit function theorem to compute the Jacobi matrix (or because here it is just a 1d function, the derivative) of the function that maps the parameters to the zeros. So let’s call the function that we compute the roots of f, and the the function that maps a parameter to the root of f g. So we have `f(g(\alpha), \alpha) = 0 and therefore \frac{\partial}{\partial \alpha} f(g(\alpha), \alpha) = 0 and \frac{\partial g}{\partial \alpha} = - \left(\frac{\partial f}{\partial x}\right)^{-1}\frac{\partial f}{\partial \alpha}.

I spent some more time thinking about the problem, and I think there’s also a lot of structure here that isn’t being exploited. The function is almost tan(x), with roots every \pi, but slightly offset by the additive adjustment term, which i’ll name h(x) = \frac{Dx(S_b + S_f)}{S_fS_b-D^2x^2}. The first two roots are trivial: 0 and the vertical asymptote of h(x) (the latter is also scaled by the thickness term, but a closed form should be 100% available)

In the limit, the roots occur exactly every \frac{\pi}{t}, since \lim_{x \to \infty} h(x) = 0.

All this has me wondering if a closed form for all the roots might be available by just thinking about about the offset term from the roots of tan(xt). If you just evaluate \frac{k\pi}{t} - h(\frac{k\pi}{t}) (where k is the k-th root of tan(xt), you already get an extremely close approximation to the roots you find numerically:

   From Optimizer  From Approximation
2        6.584620            6.609768
3        9.631685            9.639401
4       12.723241           12.726540
5       15.834105           15.835805

That might already be close enough (I don’t know your application), but if not it certainly suggests a solution approach that disposes of the custom Op entirely, which should be preferred.

@aseyboldt I thought maybe he was using the envelope theorem to get the derivative of parameters at the optimum, but I guess a root isn’t the same thing as an optimum. If he squared his objective function though, it would convert the roots into minima, and then envelope would apply? So given g(x) = f(x)^2, and x^\star defined such that x^\star = arg min_x g(x) and g^\star(x) = g(x^\star), then \frac{\partial g^\star(x)}{\partial a} = \frac{\partial g(x)}{\partial a} \Big |_{x^\star} ?

I’m asking as much for me here, because my eyes always glaze over when I read “envelope theorem” in a paper.

2 Likes

First of all, thank you all for replying so quickly!

  • I ran python pytensor.gradient.verify_grad and as all of you expected, the gradients I implemented were wrong. I got this message:
GradientError: GradientError: numeric gradient and analytic gradient exceed tolerance:
        At position 0 of argument 0 with shape (),
            val1 = -159.518067      ,  val2 = 0.000000
            abs. error = 159.518067,  abs. tolerance = 0.000100
            rel. error = 1.000000,  rel. tolerance = 0.000100
Exception args: 
  • Implementing python optimize.shgo worked very well together with python numpy.sort when using the squared of my objective function. The reason why I couldn’t use the \frac{\pi k}{t} - h(x) estimation is that D, Sf and Sb all span multiple orders of magnitude and some combinations yield roots quite far away from this simple estimation.

  • To the last comment of @jessegrabowski: Using the squared of my objective function will yield x^* = argmin_xg(x), which means that \frac{\partial g^\star(x)}{\partial x} = 0. Does this mean the input python grads in the grad() method is also zero?
    I understand that does not mean that \frac{\partial g(x)}{\partial a} \Big |_{x^\star} = 0, so how can I estimate a single gradient for each input, if they generate multiple roots at which I can assess the partial derivative?

@jessegrabowski The approximation is smart idea! But I guess as @M_K-C it probably depends a lot on the parameter values? Maybe there are ways to rescue something like this, by looking at different cases, and switching to a different approximation depending on parameter values, possibly followed by some newton iterations? That would probably be faster than a numerical root finder, but I don’t know if it is worth the effort, the numerical approach should work just fine I guess… :slight_smile:

I don’t really see how the envelope theorem helps here, but I might totally be missing something.
The way I learned that, it is a statement about the derivative of the value of the function at the optimum, ie \frac{d}{d \alpha} g(x^\star(\alpha), \alpha) = \frac{\partial g(x, \alpha)}{\partial x} \frac{\partial x^{\star}}{\partial x} + \frac{\partial g(x, \alpha)}{\partial \alpha} = \frac{\partial g(x, \alpha)}{\partial \alpha} = 0, equal zero because all our optima are at 0.

The grad function is used to do back-prop, so it’s accumulating chain-rule terms as it moves backwards through the computational graph. The input grads is the accumulated gradient so far, and when you return the multiplication g * grad[0], you’re adding the node’s contribution to that accumulated chain rule. It might be zero, but it also might not be.

I’m not suggesting that you actually use the simple estimation, but I am suggesting that you think carefully about the structure of the problem. I still have no idea about the domains of the parameters, but if D is strictly positive, it is close to irrelevant to the position of the roots (well, sort of – for D >> 1, the approximation is basically always good). The relative sizes of S_f and S_b define a vertical asymptote that divides R^+ into two subsets which should yield (i believe) two root formula. You will always know the position of the asymptote given the parameters. Here’s a Desmos implemention of your objective function that you can play with to convince yourself that this “two regime” model makes sense.

My point really is that when faced with a question where you have to write a new Op to bring in an optimizer, the answer is “don’t” unless you’ve really exhausted every analytical avenue, which I don’t believe you have.

@aseyboldt I don’t think you’ve misstated it at all, but why must it be equal to zero? Given the assumptions, only the first term of the first term (which should be x^\star in the denominator) has to be zero, not the whole expression?

(My reference for this is Walde 2009, page 47)

Wait, why didn’t I know about desmos? That looks like a fun tool :slight_smile:

Given the assumptions, only the first term of the first term (which should be x⋆ in the denominator) has to be zero, not the whole expression?

The first term is zero because at an optimum the gradient of the objective function has to be zero.
But we also know that by construction the optimum at those optima that came from a zero of the original function f is always zero, ie \phi(\alpha) := g(x^\star(\alpha), \alpha) = 0 for all \alpha in a neighborhood of \alpha^\star, and therefore the derivative is also zero.

Desmos is great! I like to use it for this kind of intuition building. Great as a teaching tool too.

Yes you’re 100% right about the function, my mistake; I didn’t account for the fact that the minimum of the “loss” is a root of the original function. But the implicit function method you mention will still work fine!

Hi @jessegrabowski, I agree that it would be much better to have an analytical version to estimate the roots and not bother with a custom Op at all.
The parameters are typically: thickness ~ 10^{-5}, D between 10^{-4} and 1, Sb and Sf between 1 and 10^{4}. This makes the ‘two-regime’ estimation a bit more complicated, because the vertical asymptote that defines the two can now move with D as well (as your nice Desmos implementation also shows).

So instead, I updated the partial derivatives to match with the ‘new’ squared version of my function (something, which I should have done before I guess) and now pytensor.gradient.verify_grad runs, if the absolute error is smaller than 10^{-10}.
The PyTensor documentation of the grad() method states that the output must be \frac{\partial\alpha}{\partial x} = \frac{\partial\alpha}{\partial g(x)} * \frac{\partial g(x)}{\partial x}, but as we are now at a minimum, \frac{\partial g(x)}{\partial x} should be 0, so the matrix product should also be 0. This would be in line with the val2 of pytensor.gradient.verify_grad. Or am I getting something completely wrong here?

I’m not entirely sure what you current implementation looks like right now.

Maybe it helps to formalize this a bit more:

We have a parameter vector \alpha\in\mathbb{R}^3.
We have a function f_\alpha: \mathbb{R}\to \mathbb{R}, the function you want to compute roots of.
Then, we have a function g_\alpha: \mathbb{R} \to \mathbb{R}, x\mapsto f_\alpha(x)^2.

We have the function r(\alpha):\mathbb{R}^3\to \mathbb{R^5} that returns the first 5 roots of f_\alpha.

And lastly, we have a function h: \mathbb{R}^5 \to \mathbb{R}, that maps your roots to a scalar. You don’t really know what that function is, pymc will build it based on how you use your RootFinder Op in the model.

You are building an Op for the r function, so in perform, you want to somehow find the roots. pymc/pytensor do not care at all how you do this, and you might do that by finding optima of g numerically. Keep in mind though that all roots of f correspond to an optimum for g, but some optima of g might or might not correspond to roots of f.

For the sampler to work, pytensor will then need to compute \frac{d}{d \alpha} h(r(\alpha)) = J_h(r(\alpha)) J_r(\alpha). This is what happens in the grad function. The grads input is J_h(r(\alpha)), which in this case you can think of as a row vector (pytensor doesn’t distinguish between row and column vectors, so this is just a 1d vector). And the output of the grad function should be the above vector-matrix product, ie a 1d vector \in \mathbb{R}^3 again. (As a sidenote: we could also require the code to return the jacobi matrix J_r(\alpha) itself, instead of a vector matrix product, but for most Ops it is much faster to compute that product without ever storing the whole matrix in memory, which might be gigantic. That whole idea for computing gradients is called backward mode autodiff).

And you can compute the jacobi matrix `J_r(\alpha)$ using the implicit function theorem:

J_r(\alpha)_{ij} = \frac{\partial r_i}{\partial \alpha_j} = - \left.\left(\frac{\partial f(x)}{\partial \alpha_j}\right)^{-1}\right|_{x=r(\alpha)_i} \left.\frac{\partial f(x)}{\partial \alpha_j}\right|_{x=r(\alpha)_i}
3 Likes