梯度累積與自定義 model.fit TF.Keras？

Question

請對您的想法添加最低限度的評論，以便我可以改進我的查詢。 謝謝。 :)

我正在嘗試使用梯度累積（GA）訓練tf.keras model。 但我不想在自定義訓練循環（ like ）中使用它，而是通過覆蓋train_step來自定義.fit()方法。這可能嗎？ 如何做到這一點？ 原因是如果我們想從keras內置功能（如fit 、 callbacks ）中受益，我們不想使用自定義訓練循環，但同時如果train_step某種原因（如 GA 或否則）我們可以自定義fit方法，並且仍然可以使用這些內置函數。

而且，我知道使用GA的優點，但使用它的主要缺點是什么？ 為什么它不是默認的，而是框架的可選功能？

# overriding train step 
# my attempt 
# it's not appropriately implemented 
# and need to fix 
class CustomTrainStep(tf.keras.Model):
    def __init__(self, n_gradients, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.n_gradients = n_gradients
        self.gradient_accumulation = [tf.zeros_like(this_var) for this_var in \
                                           self.trainable_variables]

    def train_step(self, data):
        x, y = data
        batch_size = tf.cast(tf.shape(x)[0], tf.float32)  
        # Gradient Tape
        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)
            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
        # Calculate batch gradients
        gradients = tape.gradient(loss, self.trainable_variables)
        # Accumulate batch gradients
        accum_gradient = [(acum_grad+grad) for acum_grad, grad in \
               zip(self.gradient_accumulation, gradients)]
        accum_gradient = [this_grad/batch_size for this_grad in accum_gradient]
        # apply accumulated gradients
        self.optimizer.apply_gradients(zip(accum_gradient, self.trainable_variables))
        # TODO: reset self.gradient_accumulation 
        # update metrics
        self.compiled_metrics.update_state(y, y_pred)
        return {m.name: m.result() for m in self.metrics}

請運行並檢查以下玩具設置。

# Model 
size = 32
input = tf.keras.Input(shape=(size,size,3))
efnet = tf.keras.applications.DenseNet121(weights=None,
                                          include_top = False, 
                                          input_tensor = input)
base_maps = tf.keras.layers.GlobalAveragePooling2D()(efnet.output) 
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', 
                                             name='primary')(base_maps) 
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])

# bind all
custom_model.compile(
    loss = tf.keras.losses.CategoricalCrossentropy(),
    metrics = ['accuracy'],
    optimizer = tf.keras.optimizers.Adam() )

# data 
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.expand_dims(x_train, -1)
x_train = tf.repeat(x_train, 3, axis=-1)
x_train = tf.divide(x_train, 255)
x_train = tf.image.resize(x_train, [size,size]) # if we want to resize 
y_train = tf.one_hot(y_train , depth=10) 

# customized fit 
custom_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)

更新

我發現其他一些人也試圖實現這一目標並最終遇到了同樣的問題。 一個有一些解決方法， here ，但它太亂了，我認為應該有一些更好的方法。

Answer 1

是的，可以通過在沒有自定義訓練循環的情況下覆蓋train_step來自定義.fit()方法，下面的簡單示例將向您展示如何訓練具有梯度累積的簡單 mnist 分類器：

import tensorflow as tf

# overriding train step 
# my attempt 
# it's not appropriately implemented 
# and need to fix 
class CustomTrainStep(tf.keras.Model):
    def __init__(self, n_gradients, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)
        self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)
        self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]

    def train_step(self, data):
        self.n_acum_step.assign_add(1)

        x, y = data
        # Gradient Tape
        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)
            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
        # Calculate batch gradients
        gradients = tape.gradient(loss, self.trainable_variables)
        # Accumulate batch gradients
        for i in range(len(self.gradient_accumulation)):
            self.gradient_accumulation[i].assign_add(gradients[i])
 
        # If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing
        tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)

        # update metrics
        self.compiled_metrics.update_state(y, y_pred)
        return {m.name: m.result() for m in self.metrics}

    def apply_accu_gradients(self):
        # apply accumulated gradients
        self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))

        # reset
        self.n_acum_step.assign(0)
        for i in range(len(self.gradient_accumulation)):
            self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))

# Model 
input = tf.keras.Input(shape=(28, 28))
base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)
base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps) 
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])

# bind all
custom_model.compile(
    loss = tf.keras.losses.CategoricalCrossentropy(),
    metrics = ['accuracy'],
    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )

# data 
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.divide(x_train, 255)
y_train = tf.one_hot(y_train , depth=10) 

# customized fit 
custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)

輸出：

Epoch 1/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584
Epoch 2/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600
Epoch 3/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748

優點：

梯度累積是一種將用於訓練神經網絡的樣本批量拆分為幾個小批量樣本的機制，這些樣本將按順序運行

Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, ie using less memory to training the model like it using large批量大小。

示例：如果您以 5 步和 4 幅圖像的批量大小運行梯度累積，它的目的幾乎與以 20 幅圖像的批量大小運行相同。

我們還可以在使用 GA 時並行訓練，即聚合來自多台機器的梯度。

需要考慮的事項：

這種技術效果很好，因此被廣泛使用，在使用它之前需要考慮的事情很少，我認為它不應該被稱為缺點，畢竟 GA 所做的只是將4 + 4變為2 + 2 + 2 + 2 .

如果你的機器有足夠的memory來滿足已經足夠大的batch size那么就沒有必要使用它，因為眾所周知batch size太大會導致泛化性差，如果你使用GA肯定會運行得更慢達到您機器的 memory 已經可以處理的相同批量大小。

參考：

什么是深度學習中的梯度累積？

Answer 2

感謝@Mr.For Example 的方便回答。

通常，我還觀察到使用Gradient Accumulation不會加速訓練，因為我們正在執行n_gradients次forward傳遞並計算所有梯度。 但它會加快我們 model 的收斂速度。 而且我發現在這里使用mixed_precision技術在這里真的很有幫助。 詳情在這里。

policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)

這是一個完整的要點。

梯度累積與自定義 model.fit TF.Keras？

問題描述

更新

2 個解決方案

解決方案1
5 已采納 2021-03-08 05:49:54

優點：

需要考慮的事項：

解決方案2
1 2021-03-07 01:23:41

梯度累積與自定義 model.fit TF.Keras？

問題描述

更新

2 個解決方案

解決方案1 5 已采納 2021-03-08 05:49:54

優點：

需要考慮的事項：

解決方案2 1 2021-03-07 01:23:41

解決方案1
5 已采納 2021-03-08 05:49:54

解決方案2
1 2021-03-07 01:23:41