Learning rate scheduling for Variational Inference

Hi there!

I’ve started using PyMC’s variational inference to fit my models. I have no prior experience with this, but I guess there’s parallels with training neural networks, where the choice of optimiser and learning rate make a huge impact on training quality and speed. A common technique for training neural networks is using learning rate schedulers which reduce the learning rate on a schedule, to get faster convergence by starting high and reducing it in successive epochs where you want to be more precise.

I’m thinking about implementing this in PyMC, either nicely or in a hacky way to begin with, via a callback. Does anyone have any ideas or comments on this?

Cheers!

I was just thinking about this yesterday, and I’d be interested on helping you work to add this feature.

I was thinking of copying the PyTorch API, which looks like this:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = ExponentialLR(optimizer, gamma=0.9)

for epoch in range(20):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler.step()

You wrap the optimizer in a LearningRateScheduler, then internally the scheduler updates whenever you call scheduler.step().

I haven’t looked closely at the problem, but I think we could do something similar for the pytensor optimizers. The advantage of doing things this way is that it allows for flexible parameterization of the learning rate schedules, and (potentially) even chaining schedulers together.

Anyway if you put something together (whether it’s this way or via callback), open a PR on the PyMC repo and I’d be excited to look at it.

Hi Jesse, thanks for your reply.

I’m not familiar with PyMC’s internals, but I guess callbacks are not a good option as they probably aren’t supposed to modify the optimiser. Anyway, will have a look and report back.

Hi,

I’ve had a look at PyMC’s code and I’m wondering whether there’s a way of updating the optimiser’s learning rate on the fly via a callback. The call sequence seems to be:

  1. pymc.variational.inference.fit()
  2. pymc.variational.inference.Inference.fit()
  3. pymc.variational.inference._iterate_with_loss()

and the optimiser (passed as obj_optimizer) is compiled into a step_function (no idea what goes on here).

I guess my question is, is the optimiser accessible and modifiable from pymc.variational.inference._iterate_with_loss(), which is where the training loop lies?

Thanks!

So what you want is to have the learning_rate parameter be a pytensor.shared variable. This is the class of variables that are allowed to dynamically change after a function is compiled. The wrinkle is that you need to provide pytensor with a mapping old_value:new_value that tells it how to update these shared variables. This is called updates, and it’s created by the optimizer itself.

This is why I think the best way to go is to write a wrapper that intercepts the optimizer, replaces the learning rate with a shared variable, and then injects a new update describing how to update the learning rate as a function of time. Here’s a really rough draft of a multiplicative step scheduler:

import pytensor
import pytensor.tensor as pt
from functools import wraps

def step_lr_scheduler(optimizer, update_every, gamma=0.1):
    # optimizer is a functools.parital, so we can update the keyword arguments in place
    # by mutating the .keywords dictionary
    kwargs = optimizer.keywords
    
    # Replace the provided learning_rate wiht a shared variable
    shared_lr = pytensor.shared(kwargs['learning_rate'], 'learning_rate')
    
    # Set partial function keyword argument to the new shared variable
    kwargs['learning_rate'] = shared_lr
    
    @wraps(optimizer)
    def scheduled_optimizer(*args, **kwargs):
        # Get the updates dictionary from optimizer
        updates = optimizer(*args, **kwargs)
        
        # The last update for all the optimizer is the timestep (is this always true?)
        # we need to use the time shared variable to do our lr update (so everything is in synch)
        t = updates[list(updates.keys())[-1]]
        
        # Here's the acutal learning rate update function
        new_lr = pt.switch(pt.eq(pt.mod(t, update_every), 0),
                                 shared_lr * gamma,
                                 shared_lr)
        
        # Add the learning rate update to the updates dictioanry
        updates[shared_lr] = new_lr
        return updates
    
    # Return the wrapped optimizer partial function
    return scheduled_optimizer

Here’s how it would work:

# Generate some fake regression data
true_params = np.random.normal(size=(4,))
X_data = np.random.normal(size=(100, 4))
targets = X_data @ true_params

# Do linear regression via SGD
X_data_pt = pt.as_tensor_variable(X_data)
targets_pt = pt.as_tensor_variable(targets)

# Make parameters to be optimized by the scheduled adam optimizer
params = [pytensor.shared(np.zeros(4,), name='params')]

# Compute the total squared error as the loss function
loss = ((X_data_pt @ params[0] - targets_pt) ** 2).sum()

# Make the scheduled optimizer
optimizer = step_lr_scheduler(pm.adam(learning_rate=1.0), 10, gamma=0.99)
optimizer

>>> Out: <function __main__.step_lr_scheduler.<locals>.scheduled_optimizer(loss_or_grads=None, params=None, *, learning_rate=learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-08)>

So as before we get back a partial function. As you can see, the learning_rate kwarg is now “learning_rate”, which is the name of our shared variable. We can compile the dummy function and run it a bunch of times to make sure the learning rate is updating as expected:

# Create the update dict for everything in the model
updates = optimizer(loss, params)

# Specifically grab the learning rate variable, because we are interested to track it and make sure
# it updates as expected
new_lr = list(updates.values())[-1]

# Compile the SGD function. We don't need any inputs (everything is shared), but we want to see
# the loss, learning rate, and parameter values at every step
# Passing updates is a must, because that's how the shared variables get updated
f = pytensor.function([], [loss, new_lr] + params, updates=updates)

# Execute the function a bunch of times (simulates a training batch)
for _ in range(100):
    print(f())

This gives us:

[array(864.79879456), array(1.), array([0., 0., 0., 0.])]
[array(92.89969091), array(1.), array([-1., -1.,  1., -1.])]
[array(110.41362865), array(1.), array([-1.89211474, -1.52339263,  1.85110198, -1.83649572])]
[array(311.38256283), array(1.), array([-2.48227539, -1.4262917 ,  2.29689346, -2.24316702])]
[array(321.97346962), array(1.), array([-2.68081503, -1.05816636,  2.31979847, -2.23195632])]
[array(199.61391494), array(1.), array([-2.58909828, -0.61716199,  2.08470927, -1.9761059 ])]
[array(87.54677018), array(1.), array([-2.32732722, -0.26504298,  1.72133361, -1.60492838])]
[array(47.85639667), array(1.), array([-1.9874953 , -0.12268673,  1.32737604, -1.21745443])]
[array(71.28263936), array(1.), array([-1.64594209, -0.19705334,  0.9908905 , -0.90213686])]
[array(112.27732672), array(0.99), array([-1.36816543, -0.42513222,  0.78195331, -0.72348732])]

So it seems to be working. You should be able to use the scheduled_optimizer as a drop-in replacement for pm.adam. You can see the learning rate updates to 0.99 on the 10th step. If you run this yourself you’ll also see the parameters converge to true_params, which is what we want.

Obviously this has a lot of room for improvement, but these are the broad strokes I’m thinking about.

I’ve been playing with my actual use-case and your suggested code seems to work well :slightly_smiling_face:

I’ve always found TensorFlow’s ReduceLROnPlateau extremely useful as you don’t have to worry about balancing the number of iterations and the gamma decay rate. I reckon this could be implemented easily as well, monitoring the change in the inferred parameters, and perhaps combined with the early stopping given by pymc.variational.callbacks.CheckParametersConvergence.

On a side note, I found monitoring the learning rate with a callback very useful. Perhaps this could be included in your step_lr_scheduler code to make it easier to access.

optimiser = step_lr_scheduler(pm.adam(learning_rate=1e-0), 1, gamma=0.999)

Tracker(
    lr=lambda: optimiser.__wrapped__.keywords["learning_rate"].get_value()
)

Well, I guess a ReduceLROnPlateau-like scheduler would need access to the current current and previous parameters to compute the difference, and it’s not obvious to me where to grab those from, but it should be doable :laughing:

Perhaps this could be achieved without much coding by doing the following:

  1. Set up a model & optimiser with the desired initial learning rate
  2. Train on data with a CheckParametersConvergence callback.
  3. When it finishes, create another optimiser with a lower learning rate, and train again starting from the most recent inferred parameters (not sure how to do this though).
  4. Iterate.

You have the parameter values and the current loss inside the scheduled_optimizer function. I wrote *args, **kwargs because I’m lazy, but if you actually check, the first two arguments to adam are loss_or_grads, params. So you can re-write the inner function to be more clear:

    @wraps(optimizer)
    def scheduled_optimizer(loss_or_grads, params, *args, **kwargs):
        # Get the updates dictionary from optimizer
        updates = optimizer(loss_or_grads, params, *args, **kwargs)

        # ....

So you have the current param values explicitly, you could do something with that. You need to make an update mapping from the old params (what comes into the wrapper function) to the new params (what is output by the optimizer). Once you have those, you can compute whatever convergence criteria you like, then use that criteria inside the pt.swtich

You might be right that just having the learning_rate as a shared variable is enough to do your refitting strategy, though. Maybeyou could dispense with all the optimize wrappers, and just pass a shared variable directly, like:

learning_rate = pytensor.shared(0.01, 'learning_rate')
optim = pm.adam(learning_rate=learning_rate)
advi = pm.ADVI(obj_optimizer=optim)
approx = advi.fit(...)

Then train until convergence. After it converges, you can do learning_rate.set_value(3e-4) (or whatever), then call advi.refine() to train some more. The learning rate will be automatically reduced, because of how shared variables work.

EDIT: I found a bug, pm.ADVI ignore the obj_optimizer argument, so this doesn’t work. In principle it should, though.

If you wanted to write a wrapper, it would look like this:

def flatten_shared(shared_list):
    return pt.concatenate([sh.flatten() for sh in shared_list])

def flatten_shared_np(shared_list):
    return np.concatenate([sh.get_value().flatten() for sh in shared_list])

def convergence_lr_scheduler(optimizer, tol=1e-4, gamma=0.5):
    # optimizer is a functools.parital, so we can update the keyword arguments in place
    # by mutating the .keywords dictionary
    kwargs = optimizer.keywords
    
    # Replace the provided learning_rate wiht a shared variable
    shared_lr = pytensor.shared(kwargs['learning_rate'], 'learning_rate')
    
    # Set partial function keyword argument to the new shared variable
    kwargs['learning_rate'] = shared_lr

    @wraps(optimizer)
    def scheduled_optimizer(loss_or_grads, params, *args, **kwargs):
        # There's a conceptual thing to wrap your head around here: this code is all called only 
        # ONCE, then the update mapping it generates is used over and over. So we can make a
        # new shared variable and it will be correctly updated at every iteration.
        
        # To convince yourself of this, add a print statement inside here, then run pm.fit
        
        # The consequence is we can set up a shared variable that holds past parameter values,
        # then update prev -> new at every iteration. This lets us compute whatever convergence
        # criteria we want.
        
        # (We could also do the same thing with the loss function, if desired)
        # We want a flat vector of parameters for the convergence checks. Start with all zeros.
        prev_params = pytensor.shared(flatten_shared_np(params), 'prev_params')
        
        # Compute updates as normal, with the new values 
        updates = optimizer(loss_or_grads, params, *args, **kwargs)
        
        # Grab the current timestep to prevent a LR update on step 0
        values = list(updates.values())
        t = values[-1]
        
        # Make a (symbolic!) flat vector of parameters
        flat_params = flatten_shared(new_params)
        
        # Compute convergence critera 
        sq_change = ((flat_params - prev_params) ** 2).sum()
        sq_change = pytensor.printing.Print('sq_change')(sq_change)
                        
        # Check for a plateau. If we find one, reduce the learning rate
        new_lr = pt.switch(pt.and_(pt.lt(sq_change, tol), pt.gt(t, 0)),
                                 shared_lr * gamma,
                                 shared_lr)
        
        # Add the learning rate update to the updates dictioanry
        updates[shared_lr] = new_lr
        
        # Update the prev_params to the new_params
        updates[prev_params] = flat_params

        return updates
    
    # Return the wrapped optimizer partial function
    return scheduled_optimizer

I tried to comment it for clarity, but let me know if it’s not clear what’s going on.

1 Like

Many thanks for the quick reply, will have a proper look on Monday.

I had actually noticed that the scheduler code was called only once but the code comments are very useful for learning what’s going on behind the scenes!

Do you reckon it’s worth trying to get this new functionality into PyMC? You seem to have solved the problem, and I imagine this could be useful for many people, so could just stick it in a PR.

I personally want it, so yes, but can you show me how you would add the tracker to let the LR be used in a callback? I’m not at all familiar with how that part of the API works. You’ll also want to try tracking the old and new parameters to make sure it’s updating as expected.

I think there was actually an error in this regard edited an error – the old params should just be initialized as the params themselves, not zeros. Here’s a snippet that shows how it I think it works:

params = pytensor.shared(np.ones(3,), 'params')
old_params = pytensor.shared(params.get_value(), 'old_params')
new_params = params + 1
updates = {params:new_params,
           old_params:new_params}
f = pytensor.function([], [old_params, new_params], updates=updates)
for _ in range(3):
    print(f())

>>> [array([1., 1., 1.]), array([2., 2., 2.])]
>>> [array([2., 2., 2.]), array([3., 3., 3.])]
>>> [array([3., 3., 3.]), array([4., 4., 4.])]

I also added a check that t > 0, so that it doesn’t update the LR on the first step

Morning Jesse,

you’re using an undefined variable new_params in your code above:

# Make a (symbolic!) flat vector of parameters
flat_params = flatten_shared(new_params)

I guess it must be set up to be grabbed from updates?

I’m trying to put this all together and then set up callbacks to monitor convergence, etc.

Ah yea good catch. I think you can just use params here (the function input). See the minimal pytensor example I posted for why. Otherwise yeah you have to grab it out of the values side of the updates dict.

I guess params won’t work as that’s assigned to prev_params?

prev_params = pytensor.shared(flatten_shared_np(params), 'prev_params')

The new parameters should be in updates, but I’m running into dimensionality issues :thinking:

This is the minimal example I’m using for testing:

from functools import wraps

import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
import pytensor
import pytensor.tensor as pt
from matplotlib_inline.backend_inline import set_matplotlib_formats
from pymc.variational.callbacks import Tracker

set_matplotlib_formats("retina")


# %% LR scheduler
def flatten_shared(shared_list):
    return pt.concatenate([sh.flatten() for sh in shared_list])


def flatten_shared_np(shared_list):
    return np.concatenate([sh.get_value().flatten() for sh in shared_list])


def convergence_lr_scheduler(optimizer, tol=1e-4, gamma=0.5):
    # optimizer is a functools.partial, so we can update the keyword arguments in place
    # by mutating the .keywords dictionary
    kwargs = optimizer.keywords

    # Replace the provided learning_rate with a shared variable
    shared_lr = pytensor.shared(kwargs["learning_rate"], "learning_rate")

    # Set partial function keyword argument to the new shared variable
    kwargs["learning_rate"] = shared_lr

    @wraps(optimizer)
    def scheduled_optimizer(loss_or_grads, params, *args, **kwargs):
        # There's a conceptual thing to wrap your head around here: this code is all
        # called only ONCE, then the update mapping it generates is used over and over.
        # So we can make a new shared variable and it will be correctly updated at every
        # iteration.

        # The consequence is we can set up a shared variable that holds past parameter
        # values, then update prev -> new at every iteration. This lets us compute
        # whatever convergence criteria we want.

        # (We could also do the same thing with the loss function, if desired)
        # We want a flat vector of parameters for the convergence checks. Start with
        # all zeros.
        prev_params = pytensor.shared(flatten_shared_np(params), "prev_params")

        # Compute updates as normal, with the new values
        updates = optimizer(loss_or_grads, params, *args, **kwargs)

        # Grab the current time step to prevent a LR update on step 0
        values = list(updates.values())
        t = values[-1]

        # Make a (symbolic!) flat vector of parameters
        flat_params = pytensor.shared(flatten_shared_np(updates), "flat_params")

        # Compute convergence criteria
        sq_change = ((prev_params - flat_params) ** 2).sum()
        sq_change = pytensor.printing.Print("sq_change")(sq_change)

        # Check for a plateau. If we find one, reduce the learning rate
        new_lr = pt.switch(
            pt.and_(pt.lt(sq_change, tol), pt.gt(t, 0)), shared_lr * gamma, shared_lr
        )
        # new_lr = shared_lr * gamma

        # Add the learning rate update to the updates dictionary
        updates[shared_lr] = new_lr

        # # Update the prev_params to the new_params
        updates[prev_params] = flat_params

        return updates

    # Return the wrapped optimizer partial function
    return scheduled_optimizer


# %% Fit a linear regression model to some synthetic data
length = 720

rng = np.random.default_rng(1337)
x = np.linspace(1e-2, 1, num=length)
true_regression_line = 5 * x + 4
y = true_regression_line + rng.normal(0, 1, size=length)
y[rng.integers(0, length, size=10)] += rng.normal(0, 4, size=10)
y = (y - y.mean()) / y.std()


def _noise_regression(line_fit, sigma, size=None):
    return line_fit + pm.Normal.dist(mu=0, sigma=sigma, size=size)


with pm.Model() as lr_model:
    # Define priors
    intercept = pm.Normal("intercept", 0, sigma=2)
    slope = pm.Normal("slope", 0, sigma=2)

    sigma = pm.HalfNormal("sigma", sigma=2)

    # Define likelihood
    pm.CustomDist(
        "obs",
        intercept + slope * x,
        sigma,
        dist=_noise_regression,
        observed=y,
    )

    # Learning rate scheduling
    optimiser = convergence_lr_scheduler(pm.adam(learning_rate=1e-0), 1, gamma=0.975)
    tracker = Tracker(
        lr=lambda: optimiser.__wrapped__.keywords["learning_rate"].get_value()
    )

    # Inference
    fit = pm.fit(
        obj_optimizer=optimiser,
        n=200,
        callbacks=[tracker],
        random_seed=1337,
    )

_, axs = plt.subplots(1, 2, figsize=(8, 4))
axs[0].plot(fit.hist)
axs[0].set_yscale("log")
axs[1].plot(tracker.hist["lr"], ".-")
axs[1].set_yscale("log")
plt.tight_layout()

which yields the following error:

ValueError: Input dimension mismatch: (input[0].shape[0] = 6, input[1].shape[0] = 19)
Apply node that caused the error: Composite{sqr((i0 - i1))}(prev_params, flat_params)
Toposort index: 3
Inputs types: [TensorType(float64, shape=(None,)), TensorType(float64, shape=(None,))]
Inputs shapes: [(6,), (19,)]
Inputs strides: [(8,), (8,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Sum{axes=None}(Composite{sqr((i0 - i1))}.0)]]

Right, finally figured this out (I think!).

The following example works fine:

from functools import wraps

import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
import pytensor
import pytensor.tensor as pt
from matplotlib_inline.backend_inline import set_matplotlib_formats
from pymc.variational.callbacks import Tracker

set_matplotlib_formats("retina")


# %% LR scheduler
def flatten_shared(shared_list):
    return pt.concatenate([sh.flatten() for sh in shared_list])


def flatten_shared_np(shared_list):
    return np.concatenate([sh.get_value().flatten() for sh in shared_list])


def convergence_lr_scheduler(optimizer, tol=-1, gamma=0.5):
    # optimizer is a functools.partial, so we can update the keyword arguments in place
    # by mutating the .keywords dictionary
    kwargs = optimizer.keywords

    # Replace the provided learning_rate with a shared variable
    shared_lr = pytensor.shared(kwargs["learning_rate"], "learning_rate")

    # Set partial function keyword argument to the new shared variable
    kwargs["learning_rate"] = shared_lr

    @wraps(optimizer)
    def scheduled_optimizer(loss_or_grads, params, *args, **kwargs):
        # There's a conceptual thing to wrap your head around here: this code is all
        # called only ONCE, then the update mapping it generates is used over and over.
        # So we can make a new shared variable and it will be correctly updated at every
        # iteration.

        # The consequence is we can set up a shared variable that holds past parameter
        # values, then update prev -> new at every iteration. This lets us compute
        # whatever convergence criteria we want.

        # (We could also do the same thing with the loss function, if desired)
        # We want a flat vector of parameters for the convergence checks. Start with
        # all zeros.
        prev_params = pytensor.shared(flatten_shared_np(params), "prev_params")

        # Compute updates as normal, with the new values
        updates = optimizer(loss_or_grads, params, *args, **kwargs)

        # Grab the current time step to prevent a LR update on step 0
        update_values = list(updates.values())
        t = update_values[-1]

        # Grab the updated, new parameters. Make a (symbolic!) flat vector.
        new_params = flatten_shared([updates[param] for param in params])

        # Compute convergence criteria
        sq_change = ((prev_params - new_params) ** 2).sum()
        # sq_change = pytensor.printing.Print("sq_change")(sq_change)

        # Check for a plateau. If we find one, reduce the learning rate
        new_lr = pt.switch(
            pt.and_(pt.lt(sq_change, tol), pt.gt(t, 0)), shared_lr * gamma, shared_lr
        )

        # Add the learning rate update to the updates dictionary
        updates[shared_lr] = new_lr

        # # Update the prev_params to the new_params
        updates[prev_params] = new_params

        return updates

    # Return the wrapped optimizer partial function
    return scheduled_optimizer


# Fit a gamma linear regression model to some synthetic data
length = 720

rng = np.random.default_rng(1337)
x = np.linspace(1e-2, 1, num=length)
true_regression_line = 5 * x + 4
y = true_regression_line + rng.normal(0, 1, size=length)
y[rng.integers(0, length, size=10)] += rng.normal(0, 4, size=10)
y = (y - y.mean()) / y.std()


def _noise_regression(line_fit, alpha, beta, size=None):
    return line_fit + pm.Gamma.dist(alpha=alpha, beta=beta, size=size)


with pm.Model() as lr_model:
    # Define priors
    intercept = pm.Normal("intercept", 0, sigma=2)
    slope = pm.Normal("slope", 0, sigma=2)

    alpha = pm.HalfNormal("alpha", sigma=2)
    beta = pm.HalfNormal("beta", sigma=2)

    # Define likelihood
    pm.CustomDist(
        "obs",
        intercept + slope * x,
        alpha,
        beta,
        dist=_noise_regression,
        observed=y,
    )

    # Learning rate scheduling
    optimiser = convergence_lr_scheduler(
        pm.adam(learning_rate=1e-1), tol=1e-4, gamma=0.9995
    )
    tracker = Tracker(
        lr=lambda: optimiser.__wrapped__.keywords["learning_rate"].get_value()
    )

    # Inference
    fit = pm.fit(
        obj_optimizer=optimiser,
        n=20000,
        callbacks=[tracker],
        random_seed=1337,
    )

_, axs = plt.subplots(1, 2, figsize=(8, 4))
axs[0].plot(np.arange(len(fit.hist)), fit.hist)
axs[0].set_yscale("log")
axs[0].set_title("Fit history")
axs[1].plot(tracker.hist["lr"], ".-")
axs[1].set_yscale("log")
axs[1].set_title("LR")
plt.tight_layout()

1 Like

There’s an extra detail that is quite interesting.

Keras’ ReduceLROnPlateau basically monitors the evaluation loss and then reduces the LR if it has plateaued for a while. This is very useful as you can choose a high learning rate to begin with and, even if it’s way too large, the eval loss would stay large due to the large parameter swings. On the other hand, if you choose a very small learning rate, things will converge (albeit really slowly) so it won’t reduce the learning rate.

Our current convergence_lr_scheduler above is different as in it checks the magnitude of the parameter updates. This could be useful, but it’s not nearly as useful as Keras’ method. For example, it won’t reduce the learning rate if it’s too large, as the parameter swings will be large—although relatively blind! Also, if the learning rate is too small, it will converge but updates will be slow, but the current scheduler will reduce the learning rate further as the parameter updates are small!

I wonder what the best PyMC equivalent to Keras’ evaluation loss is :thinking:

The pymc.fit() function that has a history (hist) that tracks convergence—not sure about the details though. Also, the optimiser produces a score, is that equivalent as well?

You have the loss inside the function as well, in the variable loss_or_grads. You can just use that instead of params to compute the convergence

1 Like

loss_or_grads seems to be a symbolic expression, actually a list of symbolic expressions. Is the actual loss value available from within the optimiser?

You don’t want the actual loss, you want the symbolic loss. Ultimately you want to add an entry into the updates dict, old_loss → new_loss, compute a convergence criteria f(new_loss, old_loss), and use this new criteria to update the LR.

My only question is exactly what these values are. Are they loss per observation? Or is the first value the loss then the rest are gradients? Or something else? You can add a print([value.get_value() for value in loss_or_grads]) to peek inside and try to discern what you’re getting.

I have the same question :slight_smile:

loss_or_grads is a list of length two, and both values are tensors. I also tried to print their values to see what’s going on, but I get: AttributeError: 'TensorVariable' object has no attribute 'get_value' when I try to grab their values.

Ah in this case replace them with print Ops. Something like:

    def scheduled_optimizer(loss_or_grads, params, *args, **kwargs):
        loss_or_grads = [pytensor.printing.Print(f'loss_item_{i}', item) for i, item in enumerate(loss_or_grads)]

Then pass this replaced thing to the optimizer. It should spam output about the values inside it when you run the model.

Also, since the loss is not a shared variable, you can just initialize the old_loss shared variable to be np.inf, so that the first update is always loss-reducing. The update dict always maps shared → symbolic, so that’s not an issue.