[英]Gradient Accumulation with Custom model.fit in TF.Keras?
請對您的想法添加最低限度的評論,以便我可以改進我的查詢。 謝謝。 :)
我正在嘗試使用梯度累積(GA)訓練tf.keras
model。 但我不想在自定義訓練循環( like )中使用它,而是通過覆蓋train_step
來自定義.fit()
方法。這可能嗎? 如何做到這一點? 原因是如果我們想從keras
內置功能(如fit
、 callbacks
)中受益,我們不想使用自定義訓練循環,但同時如果train_step
某種原因(如 GA 或否則)我們可以自定義fit
方法,並且仍然可以使用這些內置函數。
而且,我知道使用GA的優點,但使用它的主要缺點是什么? 為什么它不是默認的,而是框架的可選功能?
# overriding train step
# my attempt
# it's not appropriately implemented
# and need to fix
class CustomTrainStep(tf.keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = n_gradients
self.gradient_accumulation = [tf.zeros_like(this_var) for this_var in \
self.trainable_variables]
def train_step(self, data):
x, y = data
batch_size = tf.cast(tf.shape(x)[0], tf.float32)
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
accum_gradient = [(acum_grad+grad) for acum_grad, grad in \
zip(self.gradient_accumulation, gradients)]
accum_gradient = [this_grad/batch_size for this_grad in accum_gradient]
# apply accumulated gradients
self.optimizer.apply_gradients(zip(accum_gradient, self.trainable_variables))
# TODO: reset self.gradient_accumulation
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
請運行並檢查以下玩具設置。
# Model
size = 32
input = tf.keras.Input(shape=(size,size,3))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
base_maps = tf.keras.layers.GlobalAveragePooling2D()(efnet.output)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax',
name='primary')(base_maps)
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])
# bind all
custom_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = tf.keras.optimizers.Adam() )
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.expand_dims(x_train, -1)
x_train = tf.repeat(x_train, 3, axis=-1)
x_train = tf.divide(x_train, 255)
x_train = tf.image.resize(x_train, [size,size]) # if we want to resize
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
我發現其他一些人也試圖實現這一目標並最終遇到了同樣的問題。 一個有一些解決方法, here ,但它太亂了,我認為應該有一些更好的方法。
是的,可以通過在沒有自定義訓練循環的情況下覆蓋train_step
來自定義.fit()
方法,下面的簡單示例將向您展示如何訓練具有梯度累積的簡單 mnist 分類器:
import tensorflow as tf
# overriding train step
# my attempt
# it's not appropriately implemented
# and need to fix
class CustomTrainStep(tf.keras.Model):
def __init__(self, n_gradients, *args, **kwargs):
super().__init__(*args, **kwargs)
self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)
self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)
self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]
def train_step(self, data):
self.n_acum_step.assign_add(1)
x, y = data
# Gradient Tape
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
# Calculate batch gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Accumulate batch gradients
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign_add(gradients[i])
# If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing
tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)
# update metrics
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
def apply_accu_gradients(self):
# apply accumulated gradients
self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))
# reset
self.n_acum_step.assign(0)
for i in range(len(self.gradient_accumulation)):
self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))
# Model
input = tf.keras.Input(shape=(28, 28))
base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)
base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps)
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])
# bind all
custom_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = ['accuracy'],
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.divide(x_train, 255)
y_train = tf.one_hot(y_train , depth=10)
# customized fit
custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)
輸出:
Epoch 1/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584
Epoch 2/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600
Epoch 3/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748
梯度累積是一種將用於訓練神經網絡的樣本批量拆分為幾個小批量樣本的機制,這些樣本將按順序運行
Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, ie using less memory to training the model like it using large批量大小。
示例:如果您以 5 步和 4 幅圖像的批量大小運行梯度累積,它的目的幾乎與以 20 幅圖像的批量大小運行相同。
我們還可以在使用 GA 時並行訓練,即聚合來自多台機器的梯度。
這種技術效果很好,因此被廣泛使用,在使用它之前需要考慮的事情很少,我認為它不應該被稱為缺點,畢竟 GA 所做的只是將4 + 4
變為2 + 2 + 2 + 2
.
如果你的機器有足夠的memory來滿足已經足夠大的batch size那么就沒有必要使用它,因為眾所周知batch size太大會導致泛化性差,如果你使用GA肯定會運行得更慢達到您機器的 memory 已經可以處理的相同批量大小。
參考:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.