Hitting a weird error to do with RNGs in Scan in a custom function inside a Potential

In the course of finding out why a gradient calc changed in pytensor > 2.12.3 (Something changed in `pytensor > 2.12.3` (and thus `pymc > 5.6.1`) that makes my `pytensor.gradient.grad` call get stuck - any ideas? - #32 by jonsedar), I’ve hit a number of pain points.

This one seems related but different to that issue, hence a new thread.

The symptom is that I when add a custom calculation (get_log_jcd below) to the model, this throws an error deep down in pytensorf.py, specifically here https://github.com/pymc-devs/pymc/blob/5f29b255127088abc552079fd03c40eb19d83bdd/pymc/pytensorf.py#L1080

 elif isinstance(client.op, Scan):
            # Check if any shared output corresponds to the RNG
            rng_idx = client.inputs.index(rng)
            io_map = client.op.get_oinp_iinp_iout_oout_mappings()["outer_out_from_outer_inp"]
            out_idx = io_map.get(rng_idx, -1)
            if out_idx != -1:
                next_rng = client.outputs[out_idx]
            else:  # No break
                raise ValueError(
                    f"No update found for at least one RNG used in Scan Op {client.op}.\n"
                    "You can use `pytensorf.collect_default_updates` inside the Scan function to return updates automatically."
                )

The error message:

  File "/Users/jon/miniforge/envs/oreum_lab2/lib/python3.10/site-packages/pymc/pytensorf.py", line 1088, in find_default_update
    raise ValueError(
ValueError: No update found for at least one RNG used in Scan Op Scan{scan_fn, while_loop=False, inplace=none}.
You can use `pytensorf.collect_default_updates` inside the Scan function to return updates automatically.

Right now I assume that my custom function get_log_jcd, or something further inside of it (the tg.jacobian?) is present as a Scan object, but the RNG is messed up somehow.

If any kind soul would like to replicate, the following code can be debugged as a script inside VSCode, and if you breakpoint at ..../lib/python3.10/site-packages/pymc/pytensorf.py:1080 you’ll hopefully see what I’m seeing.

I’m working my way through the debug stack, but dont know enough yet to spot what might be wrong. So far I see a couple of RNGs floating around where there might need to be just one? No idea.

Any and all help appreciated as always! I hope that this might be a symptom of a bug / simple oversight somewhere in the pytensor code…

REMOVED OLD BROKEN CODE :D

Even stranger… Thought I’d try to create a simpler model without the custom function, and compare how that ran to the model with the function.

With the code below I create the simpler model mdl1, create a deepcopy mdl2 and add the custom function.

Then I init_nuts for both… and both run without error (!?) pytensorf.py:1080 elif isinstance(client.op, Scan): doesn’t even get hit…

I’m sure this will make sense one day?

REMOVED EVEN MORE OLD BROKEN CODE!

I’m weary of those deepcopy calls! PyMC models can’t be trivially copied like that

Another user had problems here: Equivalent GP models, different results - #5 by jan5

Fair enough, I’ll ignore the results of the second attempt for now, but the first issue persists regarding the Scan and the RNG… I’m running out of ideas :smiley:

What if you do it “by hand”, i.e.:

grads, _ = pytensor.scan(lambda expr, x: pytensor.grad(expr, x), sequences=[y_c.ravel(), y.ravel()])
jac_diag_terms = grads.reshape(-1, k)
logdets = pt.log(jac_diag_terms).sum(axis=-1)

Here I’m exploiting the the fact that the everything is elemwise, and only computing the diagonal of the (nk,nk) raveled jacobian. If you reshape that to (n,k), each row will be the main diagonal of one of the (n,k,k) things you get from the other procedure I pitched, and you can just compute the determinant “by hand”.

That’s a nice idea, and very similar to something I / we tried earlier (Something changed in `pytensor > 2.12.3` (and thus `pymc > 5.6.1`) that makes my `pytensor.gradient.grad` call get stuck - any ideas? - #20 by jonsedar), but it gives a DisconnectedInput error, sadly

def get_log_jcd_scan(f_inv_x: pt.TensorVariable, x: pt.TensorVariable) -> pt.TensorVariable:
    """Calc log of determinant of Jacobian where f_inv_x and x are both (n, k) 
    dimensional tensors, and `f_inv_x` is a direct, element-wise transformation 
    of `x` without cross-terms (diagonals).
    Add this Jacobian adjustment to models where observed is a transformation, 
    to handle change in coords / volume. Initially dev for a model where k = 2.
    Use explicit scan
    """
    n = f_inv_x.shape[0]
    k = f_inv_x.shape[1]
    grads, _ = pytensor.scan(lambda c, w: tg.grad(cost=c, wrt=w), 
                    sequences=[f_inv_x.ravel(), x.ravel()])
    log_jcd = pt.sum(pt.log(pt.abs(grads.reshape(n, k))), axis=1)
    return log_jcd

Before now I’ve found that any slicing of the cost c in e.g. tg.grad(cost=c, wrt=w) generally leads to DisconnectedInputErrors.

FWIW I think explicitly creating the scan over tg.grad feels slightly more explicit than relying on tg.jacobian, and maybe this gives me the opportunity to declare the updates, but I’m still not sure. Several things going on here.

By any chance have you been able to replicate locally?


The full (updated) script again if you wanted to test it locally

REMOVED EVEN MORE OLD BROKEN CODE! :D

That’s perplexing, because this is essentially all jacobian is doing, see here. It uses an index sequence instead of scanning over the variables themselves, but my (admittedly uninformed) intuition is that these should generate equivalent graphs in the end.

I can’t do local tests over the weekend cuz I’m swamped with stuff, but I can circle back early next week if you don’t figure it out. I would suggest trying to boil this down to the simplest possible example to see if it’s a bug or what. Is it because it’s d(RV, data)? because it’s in a potential? Because it’s d(f(rv), data)? If the function matters, which function? etc etc.

1 Like

Unfortunately, those are completely different, because the grad can’t see the graph outside of scan when defined like that. That’s stated explicitly in the docs (second box):

https://pytensor.readthedocs.io/en/latest/tutorial/gradients.html#computing-the-jacobian

I fixed that in my WIP refactor of Scan because it’s extremely unintuitive.

Anyway there is a known bug when you try to get a gradient of a Scan wrt to something that used to have RVs but no explicit updates. I never managed to narrow down that but here is where it was last discussed:

Anyway I’ll try to actual check it locally

1 Like

Can we get something a bit more minimal?

Also @jonsedar when using scan it’s always a good practice to set strict=True and pass all sequences/non-sequences explicitly. Apologies if you are already doing that.

Specially the non-sequences should help PyTensor to understand there are no RNGs needed inside the Scan

2 Likes

Thanks guys, I’ll do a little more digging now and reply back later :slight_smile:

I’ll see if I can make this a little more minimal too

Okay… so back at laptop again and I’ve tried to narrow down to a handful of tests based on the suggestions you’ve kindly shared so far.

I’ve created a new function (not shown here for brevity) to manually calc the log_jcd using scan (based on pytensor.gradient.jacobian) with a bit more control over the scan, in particular to test enforcing strict=True, which actually hinders things there, very interesting….

Run this as a script, and you’ll get the following printout.

The TLDR is:

Test 1

.ravel() results in DisconnectedInputError, so although that’s a nice idea to avoid 2D issues, that seems a non-starter, this relates to our earlier discussion @jessegrabowski

Test 2

If for now I focus on just one of the columns of f_inv_x, then I can run scan along the vector, so that the grad(cost=f[i]) is a scalar, and I require scan strict=True .

This yields a MissingInputError for the 2 RNGs, which is due to logic in scan.basic.scan:1060 preventing them from being listed in other_shared_scan_args and thus passed into Scan.

I believe the expected thing tp dp here would be to add the RV into scan(..non_sequences=[here]), but in my case, the two RNGs come from a 2D pm.LogNormal.dist() and I’ve nothing to pass into scan() - quite interesting

Test 3:

Repeat Test 2 and allow scan strict=False

This time scan runs, because other_shared_scan_args contains the RNGs

However when I sample this in another script I get a UserWarning: RNG Variable RandomGeneratorSharedVariable(<Generator(PCG64) at 0x14942FBC0>) has multiple clients. This is likely an inconsistent random graph.

I guess I’ll try to dig into this UserWarning tomorrow…

# 994x.py
# just the essentials to allow debug

# Author: jonathan.sedar@oreum.io

# Last updated: 2023-10-21 16:06:18

# Python implementation: CPython
# Python version       : 3.10.12
# IPython version      : 8.16.1

# pymc    : 5.9.0
# pytensor: 2.17.3

# Compiler    : Clang 15.0.7 
# OS          : Darwin
# Release     : 22.6.0
# Machine     : x86_64
# Processor   : i386
# CPU cores   : 8
# Architecture: 64bit

import numpy as np
import pandas as pd
import pymc as pm
import pytensor
import pytensor.gradient as tg
import pytensor.tensor as pt
from pymc.distributions.dist_math import check_icdf_parameters, check_icdf_value
from scipy import stats

rng_np = np.random.default_rng(seed=42)
CLIP = 1e-15  # NOTE 1e-18 too small

def normal_icdf(
    x: pt.TensorVariable, mu: pt.TensorVariable = 0.0, sigma: pt.TensorVariable = 1.0
) -> pt.TensorVariable:
    """Normal icdf, default mean-center 1sd aka Inverse CDF aka PPF aka Probit
    NOTE:
    + Modified from pymc.distributions.continuous.Normal.icdf
    + Slow implementation of erfcinv not in C
    + See also https://stackoverflow.com/questions/60472139/computing-the-inverse-of-the-complementary-error-function-erfcinv-in-c
    + Hack to clip values away from edges [0., 1.]:
        Value = {0, 1} is theoretically allowed, but causes a numeric computational
        issue somewhere in pt.erfcinv which throws infs, so clip slightly away.
    + Used in oreum_lab.src.model.copula.model_a
        NOTE: Possibly after pymc > 5.5 will change to use
        y_cop_u_rv = pm.Normal.dist(mu=0., sigma=1.)
        pm.icdf(y_cop_u_rv, pt.stack([y_m1u, y_m2u], axis=1))
    """
    x = pt.clip(x, CLIP, 1 - CLIP)
    r = check_icdf_value(mu - sigma * pt.sqrt(2.0) * pt.erfcinv(2.0 * x), x)
    return check_icdf_parameters(r, sigma > 0.0, msg="sigma > 0")



## Create synthetic data =======================================================

def create_data():
    cfg = dict(r = -0.8)
    cfg.update( 
        cov = [[1.0, cfg['r']], [cfg['r'], 1.0]], 
        mu = [0.2, 2.0], 
        s = [0.5, 1.], 
        n = 10)

    c_dist = stats.multivariate_normal(mean=np.zeros(2), cov=cfg['cov'], seed=rng_np)
    m1_dist = stats.lognorm(scale=np.exp(cfg['mu'][0]), s=cfg['s'][0])
    m2_dist = stats.lognorm(scale=np.exp(cfg['mu'][1]), s=cfg['s'][1])
    u = stats.norm.cdf(c_dist.rvs(cfg['n']))
    dfy = pd.DataFrame(np.vstack([m1_dist.ppf(u[:, 0]), m2_dist.ppf(u[:, 1])]).T, 
                    columns=['m1', 'm2'])
    return cfg, dfy


## Build model =================================================================

def create_model(cfg, dfy):
    coords = dict(
        m_nm = ['m_m1', 'm_m2'],
        s_nm = ['s_m1', 's_m2'],
        y_nm = ['y_m1', 'y_m2'],
        y_u_nm = ['y_m1u', 'y_m2u'],
        y_c_nm = ['y_m1c', 'y_m2c'],
    )
    coords_m = dict(oid = dfy.index.values)

    with pm.Model(coords=coords, coords_mutable=coords_m) as mdl:
        y = pm.MutableData('y', dfy.copy(), dims=('oid', 'y_nm'))

        # 1. Define Observed Y marginal dists
        m_mu = pm.Normal('m_mu', mu=0.0, sigma=1.0, dims='m_nm')
        m_s = pm.InverseGamma('m_s', alpha=5.0, beta=4.0, dims='s_nm')
        m_d = pm.LogNormal.dist(mu=m_mu, sigma=m_s, shape=(cfg['n'], 2))

        # 3. Transform Observed Y to Uniform through marginal CDFs
        y_u = pm.Deterministic('y_u', pt.exp(pm.logcdf(m_d, y)), dims=('oid', 'y_u_nm'))

        # 4. Transform Uniformed Y to an Empirical Copula via MvN InvCDF
        y_c = pm.Deterministic('y_c', normal_icdf(y_u), dims=('oid', 'y_c_nm'))

    return mdl


## Define tests for scan  ======================================================

def test1_scan(f_inv_x, x):
    """1. Use ravel: get a `DisconnectedInputError`"""
    n = f_inv_x.shape[0]
    k = f_inv_x.shape[1]
    idx = pt.arange(n, dtype="int32")

    def get_grads(i, c, w):
        return tg.grad(cost=c[i], wrt=[w])

    grads, _ = pytensor.scan(get_grads, 
                    sequences=idx, non_sequences=[f_inv_x.ravel(), x.ravel()],
                    n_steps=n*k, name="get_grads", strict=True)
    return grads


def test2_scan(f_inv_x, x):
    """2. Focus on a single scalar cost `c[i, 0]` and `strict=True`: 
    get a `MissingInputError` for the RNG
    """
    n = f_inv_x.shape[0]
    idx = pt.arange(n, dtype="int32")

    def get_grads(i, c, w):
        return tg.grad(cost=c[i, 0], wrt=[w])

    grads, _ = pytensor.scan(get_grads, 
                sequences=idx, non_sequences=[f_inv_x, x],
                n_steps=n, name="get_grads", strict=True)
    
    return grads


def test3_scan(f_inv_x, x):
    """3. Focus on a single scalar cost `c[i, 0]` and relax `strict=False`: 
    this runs
    """
    n = f_inv_x.shape[0]
    idx = pt.arange(n, dtype="int32")

    def get_grads(i, c, w):
        return tg.grad(cost=c[i, 0], wrt=[w])

    grads, _ = pytensor.scan(get_grads, 
                sequences=idx, non_sequences=[f_inv_x, x],
                n_steps=n, name="get_grads", strict=False)
    
    return grads


## Run tests ===================================================================

if __name__ == '__main__':
    cfg, dfy = create_data()
    mdl = create_model(cfg, dfy)
    
    print("\n\nTest 1:")
    try:
        # 1. Use ravel: get a `DisconnectedInputError`
        t1 = test1_scan(mdl.y_c, mdl.y).eval()
    except tg.DisconnectedInputError as e:
        print(e)
    
    print("\n\nTest 2:")
    try:
        # Focus on a single scalar cost `c[i, 0]` and `strict=True`: 
        # get a `MissingInputError` for the 2 RNGs from the LogNormal.dist()
        t2 = test2_scan(mdl.y_c, mdl.y).eval()
    except pytensor.graph.utils.MissingInputError as e:
        print(e)
    
    print("\n\nTest 3:")
    # 3. Focus on a single scalar cost `c[i, 0]` and relax `strict=False`: 
    # runs and gives values
    t3 = test3_scan(mdl.y_c, mdl.y).eval()
    
    print(t3.shape)
    print(t3.sum(axis=0))
       

est 1:
 
Backtrace when that variable is created:

  File "/Users/jon/workspace/oreum/oreum_lab2/notebooks/994x.py", line 165, in <module>
    t1 = test1_scan(mdl.y_c, mdl.y).eval()
  File "/Users/jon/workspace/oreum/oreum_lab2/notebooks/994x.py", line 117, in test1_scan
    sequences=idx, non_sequences=[f_inv_x.ravel(), x.ravel()],



Test 2:
Scan is missing inputs: [RandomGeneratorSharedVariable(<Generator(PCG64) at 0x1465D92A0>), RandomGeneratorSharedVariable(<Generator(PCG64) at 0x14A5211C0>)]


Test 3:
/Users/jon/miniforge/envs/oreum_lab2/lib/python3.10/site-packages/pytensor/tensor/rewriting/elemwise.py:700: UserWarning: Optimization Warning: The Op erfcinv does not provide a C implementation. As well as being potentially slow, this also disables loop fusion.
  warn(
(10, 10, 2)
[[1.17571404 0.        ]
 [1.06199971 0.        ]
 [0.42035114 0.        ]
 [0.96419869 0.        ]
 [0.98001082 0.        ]
 [1.15841781 0.        ]
 [0.74526782 0.        ]
 [1.23432585 0.        ]
 [1.19653064 0.        ]
 [1.31979877 0.        ]]

I dug into the UserWarning: RNG Variable RandomGeneratorSharedVariable(<Generator(PCG64) at 0x14942FBC0>) has multiple clients. This is likely an inconsistent random graph.

Indeed here in collect_default_updates each one of the 2 RNGs was linked to the same y_c RV in my model - I think as a result of passing the y_c into my get_log_jcd_scan()

Now that I’ve figured out the scan issues, I tried mdl.replace_rvs_by_values, and lo and behold - it seems to work!

y_c_vals = mdl.replace_rvs_by_values([y_c])[0]
_ = pm.Potential('pot_jcd_y_c_on_y', get_log_jcd_scan(y_c_vals, y), dims='oid')

I think I’ll treat this thread as closed, since I changed several things and finally somehow arrived at a model that includes the log_jcd and samples without complaint! I’m going to carefully extend it through the full workflow and watch for issues…

Thanks guys for all your help! I may try to add an example notebook one day with my learnings - once I more fully understand things :slight_smile:

1 Like

I went and checked your different tests

test1_scan

Your first attempt with ravel failure has nothing to do with scan in test1_scan. Trying any at.grad(f.ravel().sum(), x.ravel()) will fail because x.ravel() is not an input to f.ravel(), even if x is an input to f.

test2_scan

Your second attempt test2_scan fails because you didn’t provide all the non_sequences. In particular you didn’t say the grad will depend on m_mu and m_s, so PyTensor Scan thinks those RVs are actual part of the graph and complains that you didn’t pass their RNGs. This should work:

    grads, _ = pytensor.scan(
        get_grads, 
        sequences=idx, 
        non_sequences=[f_inv_x, x, m_mu, m_s],
        n_steps=n, 
        name="get_grads", 
        strict=True,
    )

You know these variables are involved because they are the only ones used in the logcdf. This is also a limitation of the old PyTensor Scan. In my WIP refactor we are more conservative about what constant inputs are actually needed when the user forgot to specify them manually. The existing Scan goes all the way to the roots instead.

strict=True is very useful because it will at least raise when you have unexpected RVs in the graph!

test3_scan

test_scan3 is actually problematic. It retains RVs in the inner graph of the Scan (meaning the logp is wrong and will change every time you define it). You don’t want to provide updates (which would avoid the error later raised by ValueGradFunction ), you want to actual NOT have any RVs there. That’s why the non_sequences is very important.

So case3 is what I called a bug, where we fail to properly replace the RVs in the Scan inner graph. This is because the replace_rvs_by_values function doesn’t look inside Scans at the moment. That is something we could try to fix, but in general it’s very hard to manipulate pre-existing Scans. Anyway I’ll open an issue to track that.

Note that by calling replace_rvs_by_values before you define the Scan you avoid this issue, because you manually replaced the right RVs by their values before any Scan was even defined. This however fails in model_to_graphviz which is unfortunate.

One way to test you don’t have any unwanted RVs, just to be on the safe side, is to use pymc.testing.assert_no_rvs on model.logp(). This will raise if any RVs are found in the logp graph, even if they are inside a Scan! In test_scan3, after I add grads as a Potential I get the following:

from pymc.testing import assert_no_rvs
assert_no_rvs(mdl.logp())
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [80], in <cell line: 2>()
      1 from pymc.testing import assert_no_rvs
----> 2 assert_no_rvs(mdl.logp())

File ~/Documents/Projects/pymc/pymc/testing.py:957, in assert_no_rvs(vars)
    955 rvs = find_rvs_in_graph(vars)
    956 if rvs:
--> 957     raise AssertionError(f"RV found in graph: {rvs}")

AssertionError: RV found in graph: {m_s, m_mu}

BTW this entire thread was a great summary of the design issues with Scan, and a reason why we want to refactor it. The “WIP refactor” I keep talking about is here in case anybody is curious: Implement new Loop and Scan operators by ricardoV94 · Pull Request #191 · pymc-devs/pytensor · GitHub

2 Likes

Fan-fxxxxxing-tastic! Thank you @ricardoV94 - this is some deep expertise!

You’re dead right, I tried a new version of test2, passing in m_mu and m_s and passing them into grad as consider_constant - is that right btw? Seems to work

def test4_scan(f_inv_x, x, consts):
    """4. Focus on a single scalar cost `c[i, 0]` and `strict=True`: 
    Pass precursors RVs
    """
    n = f_inv_x.shape[0]
    idx = pt.arange(n, dtype="int32")
    
    def get_grads(i, c, w, *args):
        return tg.grad(cost=c[i, 0], wrt=[w], consider_constant=[*args])

    grads, _ = pytensor.scan(get_grads, 
                sequences=idx, non_sequences=[f_inv_x, x, *consts],
                n_steps=n, name="get_grads", strict=True)
    
    return grads

t4 = test4_scan(mdl.y_c, mdl.y, consts=[mdl.m_mu, mdl.m_s]).eval()

And now I dont need y_c_vals = mdl.replace_rvs_by_values([mdl.y_c])[0] … happy days!

You don’t even need consider_constants, the old grad call should work fine. Just ignore *args in the inner function.

Dead right - nice one :slight_smile:

Might it still help to declare the consider_constant though? Perhaps run faster?

I would err on the side of not declaring it. I don’t know exactly how it behaves and if they should be considered constants or not :slight_smile:

Heh fair enough! That does seem safer :slight_smile: