[英]Gradient Accumulation with Custom model.fit in TF.Keras?
Please add a minimum comment on your thoughts so that I can improve my query.请对您的想法添加最低限度的评论,以便我可以改进我的查询。 Thanks.
谢谢。 :)
:)
I'm trying to train a tf.keras
model with Gradient Accumulation (GA).我正在尝试使用梯度累积(GA)训练
tf.keras
model。 But I don't want to use it in the custom training loop ( like ) but customizing the .fit()
method by overriding the train_step
.Is it possible?但我不想在自定义训练循环( like )中使用它,而是通过覆盖
train_step
来自定义.fit()
方法。这可能吗? How to accomplish this?如何做到这一点? The reason is if we want to get the benefit of
keras
built-in functionality like fit
, callbacks
, we don't want to use the custom training loop but at the same time if we want to override train_step
for some reason (like GA or else) we can customize the fit
method and still get the leverage of using those built-in functions.原因是如果我们想从
keras
内置功能(如fit
、 callbacks
)中受益,我们不想使用自定义训练循环,但同时如果train_step
某种原因(如 GA 或否则)我们可以自定义fit
方法,并且仍然可以使用这些内置函数。
And also, I know the pros of using GA but what are the major cons of using it?而且,我知道使用GA的优点,但使用它的主要缺点是什么? Why it's not come as a default but an optional feature with the framework?
为什么它不是默认的,而是框架的可选功能?
# overriding train step
# my attempt
# it's not appropriately implemented
# and need to fix
class CustomTrainStep(tf.keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = n_gradients
self.gradient_accumulation = [tf.zeros_like(this_var) for this_var in \
self.trainable_variables]
def train_step(self, data):
x, y = data
batch_size = tf.cast(tf.shape(x)[0], tf.float32)
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
accum_gradient = [(acum_grad+grad) for acum_grad, grad in \
zip(self.gradient_accumulation, gradients)]
accum_gradient = [this_grad/batch_size for this_grad in accum_gradient]
# apply accumulated gradients
self.optimizer.apply_gradients(zip(accum_gradient, self.trainable_variables))
# TODO: reset self.gradient_accumulation
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
Please, run and check with the following toy setup.请运行并检查以下玩具设置。
# Model
size = 32
input = tf.keras.Input(shape=(size,size,3))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
base_maps = tf.keras.layers.GlobalAveragePooling2D()(efnet.output)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax',
name='primary')(base_maps)
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])
# bind all
custom_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = tf.keras.optimizers.Adam() )
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.expand_dims(x_train, -1)
x_train = tf.repeat(x_train, 3, axis=-1)
x_train = tf.divide(x_train, 255)
x_train = tf.image.resize(x_train, [size,size]) # if we want to resize
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
I've found that some others also tried to achieve this and ended up with the same issue.我发现其他一些人也试图实现这一目标并最终遇到了同样的问题。 One has got some workaround, here , but it's too messy and I think there should be some better approach.
一个有一些解决方法, here ,但它太乱了,我认为应该有一些更好的方法。
Yes it is possible to customize the .fit()
method by overriding the train_step
without a custom training loop, following simple example will show you how to train a simple mnist classifier with gradient accumulation :是的,可以通过在没有自定义训练循环的情况下覆盖
train_step
来自定义.fit()
方法,下面的简单示例将向您展示如何训练具有梯度累积的简单 mnist 分类器:
import tensorflow as tf
# overriding train step
# my attempt
# it's not appropriately implemented
# and need to fix
class CustomTrainStep(tf.keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)
self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)
self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]
def train_step(self, data):
self.n_acum_step.assign_add(1)
x, y = data
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign_add(gradients[i])
# If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing
tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
def apply_accu_gradients(self):
# apply accumulated gradients
self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))
# reset
self.n_acum_step.assign(0)
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))
# Model
input = tf.keras.Input(shape=(28, 28))
base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)
base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps)
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])
# bind all
custom_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.divide(x_train, 255)
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)
Outputs:输出:
Epoch 1/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584
Epoch 2/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600
Epoch 3/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748
Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially
梯度累积是一种将用于训练神经网络的样本批量拆分为几个小批量样本的机制,这些样本将按顺序运行
Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, ie using less memory to training the model like it using large batch size. Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, ie using less memory to training the model like it using large批量大小。
Example: If you run a gradient accumulation with steps of 5 and batch size of 4 images, it serves almost the same purpose of running with a batch size of 20 images.
示例:如果您以 5 步和 4 幅图像的批量大小运行梯度累积,它的目的几乎与以 20 幅图像的批量大小运行相同。
We could also parallel the training when using GA, ie aggregate gradients from multiple machines.我们还可以在使用 GA 时并行训练,即聚合来自多台机器的梯度。
This technique is working so well so it is widely used, there few things to consider before using it that I don't think it should be called cons, after all, all GA does is turning 4 + 4
to 2 + 2 + 2 + 2
.这种技术效果很好,因此被广泛使用,在使用它之前需要考虑的事情很少,我认为它不应该被称为缺点,毕竟 GA 所做的只是将
4 + 4
变为2 + 2 + 2 + 2
.
If your machine has sufficient memory for the batch size that already large enough then there no need to use it, because it is well known that too large of a batch size will lead to poor generalization, and it will certainly run slower if you using GA to achieve the same batch size that your machine's memory already can handle.如果你的机器有足够的memory来满足已经足够大的batch size那么就没有必要使用它,因为众所周知batch size太大会导致泛化性差,如果你使用GA肯定会运行得更慢达到您机器的 memory 已经可以处理的相同批量大小。
Reference:参考:
What is Gradient Accumulation in Deep Learning? 什么是深度学习中的梯度累积?
Thanks to @Mr.For Example for his convenient answer.感谢@Mr.For Example 的方便回答。
Usually, I also observed that using Gradient Accumulation , won't speed up training since we are doing n_gradients
times forward
pass and compute all the gradients.通常,我还观察到使用Gradient Accumulation不会加速训练,因为我们正在执行
n_gradients
次forward
传递并计算所有梯度。 But it will speed up the convergence of our model.但它会加快我们 model 的收敛速度。 And I found that using the
mixed_precision
technique here can be really helpful here.而且我发现在这里使用
mixed_precision
技术在这里真的很有帮助。 Details here .详情在这里。
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.