Something changed in `pytensor > 2.12.3` (and thus `pymc > 5.6.1`) that makes my `pytensor.gradient.grad` call get stuck - any ideas?

jonsedar · October 14, 2023, 8:55am

This is a really tricky one to narrow down, and I’ve tried to simplify as far as I can. I think this needs a bit of insight from the pytensor devs to help me please try to figure out where to look further for what’s changed recently.

I’ve written a copula model (loosely based on, and improving upon a working notebook by @junpenglao) that relies on transforming observed data. This evidences Potentials on the marginals and the copula, and critically also applies a Jacobian adjustment to the copula.

I’ve simplified that model a little here, but can’t reproduce the issue when I get even more simple, so here we are.

Everything works great in pymc 5.6.1, pytensor 2.12.3, but in pymc 5.9.0, pytensor 2.17.2 there seems to be a hidden (to me) issue with computing the gradient. This appears only when I include the my function to compute the Jacobian adjustment:

def log_jcd(f_inv_x: pt.TensorVariable, x: pt.TensorVariable) -> pt.TensorVariable:
    """Calc log of Jacobian determinant"""
    jcd = tg.grad(cost=pt.sum(f_inv_x), wrt=[x])
    return pt.log(pt.abs(jcd))

I’ve written a few related scenarios to (hopefully) illustrate the issue:

Scenario 1 (Old World): Use pymc 5.6.1, pytensor 2.12.3, DO apply Jacobian adjustment - everything runs great as expected

see gist 994 here

Scenario 2 (New World): Use pymc 5.9.0, pytensor 2.17.2, do NOT apply Jacobian adjustment - everything runs as expected, although obviously wider variance due to not using the Jacobian adjustment. Just want to prove that everything else runs here.

see gist 993x here

Scenario 3 (New World): Use pymc 5.9.0, pytensor 2.17.2, DO apply Jacobian adjustment - everything DOES NOT run as expected

see gist 994x here

Here in Scenario 3, two calls fail which I hope are illustrative for the pytensor devs to help me figure out:

The call to pm.init_nuts hangs forever, seemingly disappearing into an infinite loop. In the old world this call completes within seconds, but in the new world 've left it for over 50 minutes and still had no completion. When I interrupt the process I see this stacktrace init_nuts
I’ve also tried to make a call to pm.find_MAP in the hope that this might be simpler and at least let me make a start. However, this also hangs in a never ending loop. When I interrupt that process I see this stack trace find_map

I’m at a bit of a loss where to investigate further. I dont think it’s famous shape issue, and I should add that I dont see this issue in a simpler model discussed which also uses the new world pymc 5.9.0, pytensor 2.17.2 and DOES apply a Jacobian adjustment, BUT that adjustment is NOT applied to an observedRV, but rather IS applied to a FreeRV.

The combination of applying a grad to observed data in pymc 5.9.0, pytensor 2.17.2 seems to cause the issue…

I’d greatly appreciate any and all ideas to help me debug this further!

ricardoV94 · October 14, 2023, 8:37pm

Try setting pytensor.config.optimizer_verbose=True to see if you are in an infinite graph rewriting loop

jessegrabowski · October 14, 2023, 9:09pm

Unhelpful to your problem and maybe a dumb question, but where’s the determinant in the log_jcd function?

ricardoV94 · October 15, 2023, 5:59am

Probably an abuse of notation that we also do internally but he is referring to the change of variables adjustment which for 1d is just the absolute of the derivative

jonsedar · October 15, 2023, 7:56am

Haha yes, the same terminology is used here https://github.com/pymc-devs/pymc/blob/827918b42720e4108cd86af88a56229f9af85fcf/pymc/pytensorf.py#L505

I’ll try this (pytensor.config.optimizer_verbose=True) this morning - thanks for that

Also while I have you @ricardoV94 - would (should?) the adjustment look different for a 2D RV? I’ve wondered if that could also be a source of my woes. In an earlier version of this model I applied the log_jcd to each marginal 1D distribution, but in testing (using old world pytensor <=2.12.3), it seemed to yield the same results as a 2D, and I think should be the same if we’re just adding logged values to the likelihood. Still, it does seem a little odd to me to sum the 2D transformed RV over all dimensions to get a scalar that’s then evaluated w.r.t. the 2D untransformed.

i.e.

# f_inv_x  has shape (n, 2)
# x has shape (n, 2)
# but then per the formulation for cost=pt.sum(f_inv_x) this becomes scalar

jcd = tg.grad(cost=pt.sum(f_inv_x), wrt=[x])

ricardoV94 · October 15, 2023, 10:00am

I think for a multivariate transformation (not just elemwise) you should compute the actual jacobian determinant.

jonsedar · October 15, 2023, 12:57pm

I was worried you might say that

Wading out of my depth and looking around for some code to use or copy:

A. I see that pymc has a log_jac_det for TransformedRVs which only supports 0D or 1D, https://github.com/pymc-devs/pymc/blob/827918b42720e4108cd86af88a56229f9af85fcf/pymc/logprob/transforms.py#L185

B. In fact pymc has a handful of very similar looking functions related to Jacobians - none of which seem to deal with more than 1D https://github.com/pymc-devs/pymc/blob/827918b42720e4108cd86af88a56229f9af85fcf/pymc/pytensorf.py#L505

C. I see that pytensor has a jacobian for 1D variables… so perhaps I can bash this to my will…

I hope that the following is reasonable, but am more than open to being corrected! This does run in the old world pytensor (Notebook 994 above), but it doesn’t provide the nice tight variances in m_s that log_jcd did. So I suspect I’m doing something wrong

def log_jcd_2d(f_inv_x: pt.TensorVariable, x: pt.TensorVariable, n:int) -> pt.TensorVariable:
    """Calc log of Jacobian determinant 2d. Add Jacobian adjustment to models
    where observed is a transformation, to handle change in coords / volume.
    """

    # get the 1D Jacobians for the 2 dimensions of f_inv_x, each w.r.t the 2D x
    j0 = tg.jacobian(expression=f_inv_x[:, 0], wrt=x)  # (n, n, x.shape[1]) = (10, 10, 2)
    j1 = tg.jacobian(expression=f_inv_x[:, 1], wrt=x)  # (n, n, x.shape[1]) = (10, 10, 2)

    # all off-diagonals are zero, so we can get the jcd by product of the diagonals
    ij = (np.arange(n, dtype='int').tolist(), np.arange(n, dtype='int').tolist())
    diag = pt.concatenate((j0[:, :, 0][ij], j1[:, :, 1][ij]))

    # sum the logs instead, because more numerically stable, and because we're 
    # returning a log anyway, so it's one less operation
    return pt.sum(pt.log(pt.abs(diag)))

^ is this appallingly bad for any reason?

Using this I also get new warnings (similar case What is an inconsistent graph? - #3 by Galen_Seilis) during sampling

pymc/pytensorf.py:1056: UserWarning: RNG Variable RandomGeneratorSharedVariable(<Generator(PCG64) at 0x154FE2DC0>) has multiple clients. This is likely an inconsistent random graph.

jessegrabowski · October 15, 2023, 2:29pm

What’s the shape of f_inv_x? For the jacobian to make sense in this context it should be (n,), and x is also (n,), then you can just do jac = tg.jacobian(f_inv_x, x), which is (n, n), then _, logdet = pt.linalg.slogdet(jac) and you’re done.

jonsedar · October 15, 2023, 2:34pm

In my case f_inx_x and x both have shape (n, 2), it’s a 2D joint distribution where each marginal is independently transformed through other CDFs and iCDFs

I’ve just updated the function above BTW and it seems to work?

But I’d very happily use slogdet on the assumption that it’s better tested / edge case handling than my code… I’ll give it a try now to compare, thanks!

jessegrabowski · October 15, 2023, 2:47pm

So the n is the batch? I think in that case you need to broadcast tg.jacobian to compute one jacobian (and ultimately one determinant) for each sample. It should work out of the box with the new blockwise feature?

Edit: Yeah n observation is what I mean by “batch”.

Looking at the shape outputs, it’s automatically batching but the shapes are wrong. The correct output should be (10, 2, 2), not (10, 10, 2). You’re getting back 10 (10, 2) matrices, which are the result of taking jacobian of an (n,) vector with a (2,) vector. What you want is 10 2\times2 matrices with 4 partial derivatives, each evaluated at the n-th observation

jonsedar · October 15, 2023, 2:48pm

n is just the count of observations in the dataset…

jonsedar · October 15, 2023, 2:57pm

Aha! that does sound a lot more reasonable… I’ll see later then (after dinner) what I can do to pass the right shape / orientation - because simply

j0 = tg.jacobian(expression=f_inv_x[:, 0], wrt=x[:,0])

throws a DisconnectedInputError

jessegrabowski · October 15, 2023, 2:58pm

I think it should just be:

jac = tg.jacobian(f_inv_x, x) # (10, 2, 2)
_, logdet = pt.linalg.slogdet(jac) # (10, )
return logdet

jonsedar · October 15, 2023, 4:50pm

Alas as I’ve seen before

j = tg.jacobian(expression=mdl.y_c, wrt=mdl.y)
j.eval()
>> ValueError: jacobian expects a 1 dimensional variable as `expression`. If not use flatten to make it a vector

jessegrabowski · October 15, 2023, 5:12pm

If x is also (10,2) then I’m confused by what’s going on. My idea was that you have two equations y_1, y_2 and two variables x_1, x_2, so that the jacobian is J = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} \end{bmatrix}

You can then plug in different numerical values for those, but “symbolically” they should all be the same object, just evaluated at different points. This is what I meant by “batch dimension”. As a concrete example, given a system of two equations:

\begin{align} y_{i,1} &= a x_{i,1}^2 + bx_{i,2} \\ y_{i,2} &= cx_{i,1} + dx_{i,2}^2 \end{align}

The jacobian I have in mind is J_i = \begin{bmatrix} 2ax_{i,1} & b \\ c & 2dx_{i,2}\end{bmatrix}, so you can then plug in a bunch of different x_{i,1}, x_{i,2} pairs and get a stack of transformations associated with each datapoint (the i “batch” dimension)

In pytensor terms, I expect x to be shape (2,), because although we’re going to ultimately broadcast this Jacobian function to many different values, they are all sharing the same computation.

If x is (n,2), it seems like each i get it’s own x unique variable that has no conceptual connection to the others, like: (x_{1,1}, x_{1,2}), (x_{2,1}, x_{2,2}), \dots, and we can’t “summarize” with the x_i notation. In this case you have a stack on n systems of 2 equations, and you want the result to be a stack jacobians, one per system. You’ll probably have to scan over the x-y pairs if that’s the case.

jonsedar · October 15, 2023, 5:37pm

Using scan is a good idea - thanks I’ll have to try a few things tomorrow. There’s potentially precedent here but I need to work through it to understand more…

At the moment it’s not especially obvious to me why gradient.jacobian(expression, wrt, ... expects expression as a 1D, rather than handing larger dims. Possibly combinatoric expansion?

ricardoV94 · October 15, 2023, 6:02pm

In the context of that source code, 1D are standard multivariate transforms of vector to vector functions. 0D is for elemwise/scalars transforms.

So don’t be confused by the 0/1/2. It sounds like you want to handle batched vector transforms like @jessegrabowski mentioned.

Can we have details on the exact type of transformation that’s going on?

jonsedar · October 16, 2023, 7:06am

Interesting… maybe I’ve misunderstood a concept here - and/or am simply trying something unusual by transforming data

The specific model is in this gist here (994_mre_copula_with_jcd Scenario 1 (Old World): Use `pymc 5.6.1, pytensor 2.12.3` · GitHub) (also linked in my first question above).

In this model I believe I need to apply the Jacobian adjustment to the RV y_c which is the result of transforming observed marginal data y n observations, 2 marginals, shape (n, 2) via a lognormal_cdf and then a normal_icdf. So y_c is also shape (n, 2).

...
# 3. Transform Observed Y to Uniform through marginal CDFs
y_u = pm.Deterministic('y_u', pt.exp(pm.logcdf(m_d, y)), dims=('oid', 'y_u_nm'))

# 4. Transform Uniformed Y to an Empirical Copula via MvN InvCDF
#    to be later evidenced against our latent MvN copula RVs
y_c = pm.Deterministic('y_c', normal_icdf(y_u), dims=('oid', 'y_c_nm'))
...

So this seems like a different situation to transforming unobserved RVs in for example, this gist: (991_mre_jacobiandet_issue.ipynb · GitHub) which is the port of Bob Carpenter’s example that I posted before.

I’ll try the stacked system approach and report back

jonsedar · October 16, 2023, 9:38am

Hmmm…

How is pytensor.gradient.jacobian supposed to be used? Can the variance in the w.r.t be a slice of a tensor? Indeed, perhaps I’m just slicing wrong?

After building my model (see gist for Notebook 994 in my above post), I’m trying to build out the scan function, with an inner function that - based on your guys recommendations above - I think (?) should operate on a slice of the (n, 2) matrices for f_inv_x and x.

Here f_inv_x is my mdl.y_c, and x is mdl.y. They both have shape (10, 2) for 10 observations in the model input data, so I honk the appropriate slices would be e.g.

i = 0
mdl.y_c[i]  # shape (2, )
mdl.y[i]  # shape (2, )

If I try this, I get a disconnected input error, so I suspect I’m slicing wrong, but dont know how to slice properly in that case

n = mdl.y_c.shape[0]
idx = pt.arange(n, dtype="int32")

i = 0
print(f'mdl.y_c[{i}]:\n', mdl.y_c[i, :].eval())
print(f'mdl.y[{i}]:\n', mdl.y[i, :].eval())

j = tg.jacobian(expression=mdl.y_c[i, :], wrt=mdl.y[i, :])
j.eval()

yields

mdl.y_c[0]:
 [1.63535418 1.37195447]
mdl.y[0]:
 [0.896752   7.10080497]
/Users/jon/miniforge/envs/oreum_lab/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:691: UserWarning: Optimization Warning: The Op erfcinv does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.
  warn(
---------------------------------------------------------------------------
DisconnectedInputError                    Traceback (most recent call last)
/Users/jon/workspace/oreum/oreum_lab/notebooks/994_mre_copula_with_jcd.ipynb Cell 12 line 8
      5 print(f'mdl.y_c[{i}]:\n', mdl.y_c[i, :].eval())
      6 print(f'mdl.y[{i}]:\n', mdl.y[i, :].eval())
----> 8 j = tg.jacobian(expression=mdl.y_c[i, :], wrt=mdl.y[i, :])
      9 j.eval()

Alternatively, if I feed the w.r.t with the full non-sliced mdl.y, that seems to work albeit the returned tensor of course contains unnecessary dimensions

n = mdl.y_c.shape[0]
idx = pt.arange(n, dtype="int32")

i = 0
print(f'mdl.y_c[{i}]:\n', mdl.y_c[i, :].eval())
print(mdl.y_c[i, :].eval().shape)
print(f'mdl.y:\n', mdl.y.eval())
print(mdl.y.eval().shape)

j = tg.jacobian(expression=mdl.y_c[i, :], wrt=mdl.y)
jj = j.eval()
print(jj.shape)
print(jj[:, 0, :])

yields:

mdl.y_c[0]:
 [1.63535418 1.37195447]
(2,)
mdl.y:
 [[ 0.896752    7.10080497]
 [ 0.99277233 20.27468766]
 [ 2.50819806  0.7689727 ]
 [ 1.09347164  7.5479452 ]
 [ 1.07582886  5.55281411]
 [ 0.91014133 21.7633982 ]
 [ 1.41469134 11.23582336]
 [ 0.85416985  8.77394179]
 [ 0.88115079  7.74155373]
 [ 0.79885203 16.73635283]]
(10, 2)
(2, 10, 2)
array([[1.00953437, 0.        ],
       [0.        , 0.11660356]])

This final 2x2 matrix actually looks like what I want, it just looks like I have to dig into the (2, 10, 2) matrix to get it

Any thoughts on side stepping the disconnected input error?

jonsedar · October 19, 2023, 1:49pm

Okay… finally got a minute to get back to this code and I’m stuck (again, hurray) on misunderestimating
what scan actually is / does and how RNGs need to be managed

I’ve an updated gist here with full detail 994_mre_copula_with_jcd WIP log_jcd_stacked_2d · GitHub

I’ve written a new function, loosely based on pymc.pytensorf.jacobian_diag (because that also seems to use scan as a cheap way to iterate over observations). My intention / hope is that this would calculate and return the log_jac_det for each (2D) observation in the dataset such that the relevant model potential _ = pm.Potential('pot_jcd_y_c_on_y', log_jcd_stacked_2d(y_c, y), dims='oid') can work.

def log_jcd_stacked_2d(f_inv_x: pt.TensorVariable, x: pt.TensorVariable
                      ) -> pt.TensorVariable:
    """Calc log of Jacobian determinant for 2D matrix, stacked version.
    Add this Jacobian adjustment to models where observed is a transformation, 
    to handle change in coords / volume. Developed for a 2D copula transform 
    usecase and likely not more broadly applicable due to shapes etc
    """
    n = f_inv_x.shape[0]
    idx = pt.arange(n, dtype="int32")

    def log_jcd_elem(i, f, x):
        """Calc element-wise log of Jacobian det for f and x. All off-diagonals 
        are zero, so we can get the det by product of the diagonals. Get that 
        product via sum of the logs instead: more numerically stable, and we're 
        returning a log anyway, so it's one less operation
        """
        jac = tg.jacobian(expression=f[i, :], wrt=x)  # (2, 10, 2)
        log_jcd = pt.sum(pt.log(pt.abs(jac[:, i, :][([0, 1], [0, 1])])))  # (1,)
        return log_jcd
    
    return pytensor.scan(log_jcd_elem, sequences=[idx], 
                         non_sequences=[f_inv_x, x], n_steps=n, 
                         name="log_jcd")[0]

First Issue

When I test with a simple eval, I get errors in the log that seem related to random number generation, but still get a result that seems plausible

with mdl:
    js = log_jcd_stacked_2d(y_c, y).eval()
print('\nLog JCD:')
print(js)

Log JCD:
[-1.44556129 -2.59644833 -0.25119382 -1.70496206 -1.38172531 -2.58040326
 -2.36034762 -1.60849001 -1.51440496 -2.18733269]

rewriting: validate failed on node normal_rv{0, (0, 0), floatX, False}.out.
 Reason: random_make_inplace, Trying to destroy a protected variable: *2-<RandomGeneratorType>

Second issue

When I try to init_nuts

with mdl:
    initial_points, step = pm.init_nuts(init='auto', chains=1)

the process bombs out with an error:

ValueError: No update found for at least one RNG used in Scan Op Scan{log_jcd, while_loop=False, inplace=none}.
You can use `pytensorf.collect_default_updates` inside the Scan function to return updates automatically.

—

My initial (failed) efforts to solve…

The ValueError suggests adding something to the internal function that I think might be like:

    def log_jcd_elem(i, f, x):
        …
        update = pm.pytensorf.collect_default_updates([log_jcd])
        return log_jcd , update

… but in the same testing this seems to have no beneficial effect

Alternatively, this note Simple markov chain with aesara scan - #11 by ricardoV94 from @ricardoV94 deals with passing rngs, but I dont see how it would apply in my case…

Any and all ideas greatly appreciated chaps!

Topic		Replies	Views
Hitting a weird error to do with RNGs in Scan in a custom function inside a Potential v5 pytensor	19	805	October 23, 2023
Dimension error when using pm.Potential v5 modeling	9	381	August 4, 2023
Dynamic shaping, "round" function, JAX, and a "few" more questions v5 modeling , jax , pytensor	28	1552	September 29, 2023
Why would scipy/numpy wrapped in @as_op cause much faster sampling than using pytensor operations? Questions	10	997	April 29, 2023
Complex model samples very slowly - wondering about possible improvements Questions	5	728	July 31, 2020

Something changed in `pytensor > 2.12.3` (and thus `pymc > 5.6.1`) that makes my `pytensor.gradient.grad` call get stuck - any ideas?

First Issue

Second issue

My initial (failed) efforts to solve…

Related topics