Keras learning rate decay in pytorch

Question

I have a question concerning learning rate decay in Keras. I need to understand how the option decay works inside optimizers in order to translate it to an equivalent PyTorch formulation.

From the source code of SGD I see that the update is done this way after every batch update:

lr = self.lr * (1. / (1. + self.decay * self.iterations))

Does this mean that after every batch update the lr is updated starting from its value from its previous update or from its initial value? I mean, which of the two following interpretation is the correct one?

lr = lr_0 * (1. / (1. + self.decay * self.iterations))

or

lr = lr * (1. / (1. + self.decay * self.iterations)) ,

where lr is the lr updated after previous iteration and lr_0 is always the initial learning rate.

If the correct answer is the first one, this would mean that, in my case, the learning rate would decay from 0.001 to just 0.0002 after 100 epochs, whereas in the second case it would decay from 0.001 at around 1e-230 after 70 epochs.

Just to give you some context, I'm working with a CNN for a regression problem from images and I just have to translate Keras code into Pytorch code. So far, with the second of the afore-mentioned interpretations I manage to only always predict the same value, disregarding of batch size and input at test time.

Thanks in advance for your help!

Answer 1

Based on the implementation in Keras I think your first formulation is the correct one, the one that contain the initial learning rate (note that self.lr is not being updated).

However I think your calculation is probably not correct: since the denominator is the same, and lr_0 >= lr since you are doing decay, the first formulation has to result in a bigger number.

I'm not sure if this decay is available in PyTorch, but you can easily create something similar with torch.optim.lr_scheduler.LambdaLR .

decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = LambdaLR(optimizer, lr_lambda=fcn)

Finally, don't forget that you will need to call .step() explicitly on the scheduler, it's not enough to step your optimizer. Also, most often learning scheduling is only done after a full epoch, not after every single batch, but I see that here you are just recreating Keras behavior.

Answer 2

Actually, the response of mkisantal might be incorrect, since the actual equation for the learning rate in keras (at least it was, now there is no default decay option) was like this:

lr = lr * (1. / (1. + self.decay * self.iterations))

(see https://github.com/keras-team/keras/blob/2.2.0/keras/optimizers.py#L178 )

And the solution presented by mkisantal is missing the recurrent/multiplicative term lr , therefore the more accurate version should be based on MultiplicativeLR :

decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = MultiplicativeLR(optimizer, lr_lambda=fcn)

Keras learning rate decay in pytorch

Question

2 answers

solution1
2 ACCPTED 2019-04-13 10:37:47

solution2
0 2020-08-27 12:19:09

Keras learning rate decay in pytorch

Question

2 answers

solution1 2 ACCPTED 2019-04-13 10:37:47

solution2 0 2020-08-27 12:19:09

solution1
2 ACCPTED 2019-04-13 10:37:47

solution2
0 2020-08-27 12:19:09