While playing with model.fit_on_batch method and custom training loops I realized that in the custom training loop code the loss and gradient do not take into account any l1-l2 regularizers and hence optimizer.apply_gradients method does not take the regularizers into account. Below you can find the code to show this but the idea is pretty simple. So my questions is if there is a method to use all these optimizers in optimizer detail agnostic way to take the regularizers into account. How is it implemented in Keras? On a related note, model.fit_on_batch returns a value that it not the loss (as claimed in the docstring) but something else. I was wondering if someone here knows what it returns.
Code
To see this effect first create some data
x=tf.constant([[1]])
y=tf.constant([[1]])
and create a function to make a reproducible model
def make_model(l1=.01,l2=.01):
tf.random.set_seed(42)
np.random.seed(42)
model=tf.keras.models.Sequential([
tf.keras.layers.Dense(2,'softmax',
use_bias=False,
kernel_regularizer=tf.keras.regularizers.l1_l2(l1=l1,l2=l2),
input_shape=(1,))
])
return model
Now run Keras train_on_batch
model=make_model()
loss_object=tf.keras.losses.SparseCategoricalCrossentropy()
optimizer=tf.keras.optimizers.RMSprop()
model.compile(loss=loss_object,optimizer=optimizer)
model.train_on_batch(x,y)
and compare the outputs with the custom training loop as explained in the above link as well as here
model=make_model()
loss_object=tf.keras.losses.SparseCategoricalCrossentropy()
optimizer=tf.keras.optimizers.RMSprop()
@tf.function
def train_step(x,y):
with tf.GradientTape() as tape:
predictions = model(x)
loss = loss_object(y, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
train_step(x,y).numpy()
You will see the two results are different unless l1==0 and l2==0.
Actually I found out the answer in Aurelien Geron's book
In fact after I implemented the code below, I found that this is covered in the tensorflow guide on custom training (I don't know why its not in the tutorials mentioned in the question since its an important point). The solution in there is more general than the one mentioned here but I am keeping this as it sheds a bit more light on whats happening.
So it is as simple as modifying the custom training loop to
def add_model_regularizer_loss(model):
loss=0
for l in model.layers:
if hasattr(l,'layers') and l.layers: # the layer itself is a model
loss+=add_model_loss(l)
if hasattr(l,'kernel_regularizer') and l.kernel_regularizer:
loss+=l.kernel_regularizer(l.kernel)
if hasattr(l,'bias_regularizer') and l.bias_regularizer:
loss+=l.bias_regularizer(l.bias)
return loss
def train_step(x,y):
with tf.GradientTape() as tape:
predictions = model(x)
loss = loss_object(y, predictions)
loss += add_model_regularizer_loss(model)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
To answer the second part of my question, it is this loss value that keras's model fit method returns.
The reccomended practice, as stated on the TF site is to use model.losses
. For example:
def train_step(x,y):
with tf.GradientTape() as tape:
predictions = model(x)
loss = loss_object(y, predictions)
loss += tf.add_n(model.losses) # <--- SEE HERE
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.