So what you want is to have the learning_rate
parameter be a pytensor.shared
variable. This is the class of variables that are allowed to dynamically change after a function is compiled. The wrinkle is that you need to provide pytensor with a mapping old_value:new_value
that tells it how to update these shared variables. This is called updates
, and it’s created by the optimizer itself.
This is why I think the best way to go is to write a wrapper that intercepts the optimizer, replaces the learning rate with a shared variable, and then injects a new update describing how to update the learning rate as a function of time. Here’s a really rough draft of a multiplicative step scheduler:
import pytensor
import pytensor.tensor as pt
from functools import wraps
def step_lr_scheduler(optimizer, update_every, gamma=0.1):
# optimizer is a functools.parital, so we can update the keyword arguments in place
# by mutating the .keywords dictionary
kwargs = optimizer.keywords
# Replace the provided learning_rate wiht a shared variable
shared_lr = pytensor.shared(kwargs['learning_rate'], 'learning_rate')
# Set partial function keyword argument to the new shared variable
kwargs['learning_rate'] = shared_lr
@wraps(optimizer)
def scheduled_optimizer(*args, **kwargs):
# Get the updates dictionary from optimizer
updates = optimizer(*args, **kwargs)
# The last update for all the optimizer is the timestep (is this always true?)
# we need to use the time shared variable to do our lr update (so everything is in synch)
t = updates[list(updates.keys())[-1]]
# Here's the acutal learning rate update function
new_lr = pt.switch(pt.eq(pt.mod(t, update_every), 0),
shared_lr * gamma,
shared_lr)
# Add the learning rate update to the updates dictioanry
updates[shared_lr] = new_lr
return updates
# Return the wrapped optimizer partial function
return scheduled_optimizer
Here’s how it would work:
# Generate some fake regression data
true_params = np.random.normal(size=(4,))
X_data = np.random.normal(size=(100, 4))
targets = X_data @ true_params
# Do linear regression via SGD
X_data_pt = pt.as_tensor_variable(X_data)
targets_pt = pt.as_tensor_variable(targets)
# Make parameters to be optimized by the scheduled adam optimizer
params = [pytensor.shared(np.zeros(4,), name='params')]
# Compute the total squared error as the loss function
loss = ((X_data_pt @ params[0] - targets_pt) ** 2).sum()
# Make the scheduled optimizer
optimizer = step_lr_scheduler(pm.adam(learning_rate=1.0), 10, gamma=0.99)
optimizer
>>> Out: <function __main__.step_lr_scheduler.<locals>.scheduled_optimizer(loss_or_grads=None, params=None, *, learning_rate=learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-08)>
So as before we get back a partial function. As you can see, the learning_rate
kwarg is now “learning_rate”, which is the name of our shared variable. We can compile the dummy function and run it a bunch of times to make sure the learning rate is updating as expected:
# Create the update dict for everything in the model
updates = optimizer(loss, params)
# Specifically grab the learning rate variable, because we are interested to track it and make sure
# it updates as expected
new_lr = list(updates.values())[-1]
# Compile the SGD function. We don't need any inputs (everything is shared), but we want to see
# the loss, learning rate, and parameter values at every step
# Passing updates is a must, because that's how the shared variables get updated
f = pytensor.function([], [loss, new_lr] + params, updates=updates)
# Execute the function a bunch of times (simulates a training batch)
for _ in range(100):
print(f())
This gives us:
[array(864.79879456), array(1.), array([0., 0., 0., 0.])]
[array(92.89969091), array(1.), array([-1., -1., 1., -1.])]
[array(110.41362865), array(1.), array([-1.89211474, -1.52339263, 1.85110198, -1.83649572])]
[array(311.38256283), array(1.), array([-2.48227539, -1.4262917 , 2.29689346, -2.24316702])]
[array(321.97346962), array(1.), array([-2.68081503, -1.05816636, 2.31979847, -2.23195632])]
[array(199.61391494), array(1.), array([-2.58909828, -0.61716199, 2.08470927, -1.9761059 ])]
[array(87.54677018), array(1.), array([-2.32732722, -0.26504298, 1.72133361, -1.60492838])]
[array(47.85639667), array(1.), array([-1.9874953 , -0.12268673, 1.32737604, -1.21745443])]
[array(71.28263936), array(1.), array([-1.64594209, -0.19705334, 0.9908905 , -0.90213686])]
[array(112.27732672), array(0.99), array([-1.36816543, -0.42513222, 0.78195331, -0.72348732])]
So it seems to be working. You should be able to use the scheduled_optimizer
as a drop-in replacement for pm.adam
. You can see the learning rate updates to 0.99 on the 10th step. If you run this yourself you’ll also see the parameters converge to true_params
, which is what we want.
Obviously this has a lot of room for improvement, but these are the broad strokes I’m thinking about.