Error with custom distribution after using scan

Hi all,

While building a time series model I’m encountering the following error: ValueError: traffic_innov is not in list. When I look at the list in question, I do find a variable named traffic_innov, so my best hypothesis at this point is that a copy is being made somewhere and then equality comparison fails despite the objects being identical from the modelling perspective.

Any help would be very much appreciated :pray:

Quick context

It’s a model of arrival times which depend on traffic. Traffic has a persistent behaviour (modelled with an AR) and is also affected by weather. Basically there are three parameters, a scan loop and a likelihood (times have been floored, hence the use of a shifted geometric instead of exponential).

I’ve tried to reduce the model’s complexity as much as possible, so the code below is not entirely representative of what I’m trying to do but it serves to reproduce the error I’m stuck with.

MRE

import numpy as np
import pandas as pd
import pymc as pm
from pymc.pytensorf import collect_default_updates
import pytensor


# Create fake data
lags = 1
n_stops = 5
steps = n_stops - lags
n_trips = 2
stop_sequence = np.arange(n_stops, dtype=int)
trips = np.arange(n_trips, dtype=int)
delay_delta = np.array(
    [
        [1, 1, 1, 2, 4],
        [2, 2, 1, 1, 1],
    ]
)
weather_states = ["sunny", "rainy", "storm"]
weather = np.array(
    [
        [0, 0, 0, 1, 2],
        [1, 1, 1, 0, 0],
    ]
)

delay_df = pd.DataFrame(delay_delta, columns=stop_sequence)
weather_df = pd.DataFrame(weather, columns=stop_sequence)
data = pd.concat([delay_df, weather_df], axis=1, keys=["delay", "weather"])

coords = {
    "weather_states": weather_states,
    "lags": range(-lags, 0),
    "stops": stop_sequence,
    "steps": stop_sequence[:-lags],
    "instances": trips,
}


def ar_step(weather, traffic_t1, rho, mu, sigma):
    innov = pm.LogNormal.dist(mu=mu[weather[0]], sigma=sigma[weather[0]])
    traffic = traffic_t1 * rho[0] + innov * (1 - rho[0])
    return traffic, collect_default_updates([traffic])


def ar_dist(ar_init, weather, rho, mu, sigma, n_steps, size):
    ar_innov, _ = pytensor.scan(
        fn=ar_step,
        sequences=[
            {"input": weather, "taps": 0},
        ],
        outputs_info=[{"initial": ar_init, "taps": [-lags]}],
        # outputs_info=[{"initial": ar_init, "taps": range(-lags, 0)}],
        non_sequences=[rho, mu, sigma],
        n_steps=n_steps,
        strict=True,
    )
    return ar_innov


def geom_shift(scale, size):
    prob = 1 - pm.math.exp(-1 / scale)
    geom = pm.Geometric.dist(p=prob, size=size)
    return geom - 1


with pm.Model(coords=coords) as ar_model:
    # Data
    delay_follow = pm.Data(
        "delay_follow_obs",
        data["delay"].values[:, 1:],
        dims=("instances", "steps"),
    )
    weather_first = pm.Data(
        "weather_first_obs", data["weather"].values[:, 1], dims=("instances",)
    )
    weather_follow = pm.Data(
        f"weather_follow_obs",
        data["weather"].values[:, 1:],
        dims=(
            "instances",
            "steps",
        ),
    )
    # Parameters
    rho = pm.Beta("rho", 5, 1, dims=("lags",))  # traffic persistence
    traffic_weather_mu = pm.Normal(
        "traffic_weather_mu", 0, 1, dims=("weather_states",)
    )
    traffic_weather_sigma = pm.HalfNormal(
        "traffic_weather_sigma", 1, dims=("weather_states",)
    )
    # AR Model
    traffic_init = [
        pm.LogNormal(
            f"init{idx}",
            traffic_weather_mu[weather],
            traffic_weather_sigma[weather],
        )
        for idx, weather in enumerate(weather_first)
    ]
    traffic_innov = pm.CustomDist(
        f"traffic_innov",
        traffic_init,
        weather_follow.reshape((steps, -1)),
        rho,
        traffic_weather_mu,
        traffic_weather_sigma,
        steps,
        dist=ar_dist,
        dims=(
            f"steps",
            f"instances",
        ),  # scan iterates over leftmost dimension
    )
    # Likelihood
    pm.CustomDist(
        "delay",
        traffic_innov.T,
        dist=geom_shift,
        observed=delay_follow,
        dims=("instances", "steps"),
    )

ar_model.point_logps()

with ar_model:
    idata = pm.sample()

Output:

...

File ~\miniconda3\envs\bus\lib\site-packages\pymc\logprob\basic.py:516, in <listcomp>(.0)
    513 node_rvs = [valued_var.inputs[0] for valued_var in valued_nodes]
    514 node_values = [valued_var.inputs[1] for valued_var in valued_nodes]
    515 node_output_idxs = [
--> 516     fgraph.outputs.index(valued_var.outputs[0]) for valued_var in valued_nodes
    517 ]
    519 # Replace `RandomVariable`s in the inputs with value variables.
    520 # Also, store the results in the `replacements` map for the nodes that follow.
    521 for node_rv, node_value in zip(node_rvs, node_values):

ValueError: traffic_innov is not in list

Additional comments

But when I check fgraph.outputs it does seem to have traffic_innov:

(Pdb++) fgraph.outputs
[ValuedRV.0, traffic_weather_mu, ValuedRV.0, ValuedRV.0, ValuedRV.0, traffic_innov, delay]

If I replace the shifted geometric (custom distribution) by the standard geometric it works well (sampling is full of divergences but it’s not like I’m feeding it sensible data), so that seems to be playing a key role.

    pm.Geometric(
        "delay",
        p=1 - pm.math.exp(-traffic_innov.T),
        observed=delay_follow,
        dims=("instances", "steps"),
    )

Still I have used custom distributions without trouble and it seems to me there’s something in the interaction with scan that’s leading to the error. I’ve tried a similar model in which traffic is not affected by weather (and hence there’s only one traffic_init variable) and there everything works fine.

Versions

  • pymc: 5.17 and 5.19.1
  • pytensor: 2.25.5 and 2.26.4