如何在 Keras 中为大批量累积梯度

Question

I am working with a very memory demanding CNN model for a task of classification.我正在使用一个非常需要内存的 CNN 模型来完成分类任务。 This poses a big limit on the batch size that I can use during training.这对我在训练期间可以使用的批量大小构成了很大限制。

One solution is to accumulate the gradients during training, meaning that the weights of the model are not updated after every single batch.一种解决方案是在训练期间累积梯度，这意味着模型的权重不会在每批之后更新。 Instead the same weights are used for several batches, while the gradients from each batch are accumulated and than averaged for a single weight-update action.相反，多个批次使用相同的权重，而每个批次的梯度被累积，然后针对单个权重更新操作进行平均。

I'm using a Tensorflow backend Keras and I'm pretty sure that Keras has no off-the-shelf function/method to achieve this.我正在使用 Tensorflow 后端 Keras，我很确定 Keras 没有现成的功能/方法来实现这一点。

How can it be done for a Keras/tensorflow model?如何为 Keras/tensorflow 模型做到这一点？

Answer 1

As was mentioned in the question, there is no off-the-shelf function/method to achieve this with Keras/Tensorflow.正如问题中提到的，Keras/Tensorflow 没有现成的功能/方法可以实现这一点。 However this can be done by writing a custom optimizer for Keras.但是，这可以通过为 Keras 编写自定义优化器来完成。

The main idea is to use a flag to determine whether to update the weights during each batch.主要思想是使用一个标志来确定是否在每个批次期间更新权重。

The following implementation is based on this github post by "alexeydevederkin" and it is an accumulating Adam optimizer:以下实现基于“alexeydevederkin”的这篇 github 帖子，它是一个累积的 Adam 优化器：

import keras.backend as K
from keras.legacy import interfaces
from keras.optimizers import Optimizer


class AdamAccumulate(Optimizer):

    def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999,
                 epsilon=None, decay=0., amsgrad=False, accum_iters=1, **kwargs):
        if accum_iters < 1:
            raise ValueError('accum_iters must be >= 1')
        super(AdamAccumulate, self).__init__(**kwargs)
        with K.name_scope(self.__class__.__name__):
            self.iterations = K.variable(0, dtype='int64', name='iterations')
            self.lr = K.variable(lr, name='lr')
            self.beta_1 = K.variable(beta_1, name='beta_1')
            self.beta_2 = K.variable(beta_2, name='beta_2')
            self.decay = K.variable(decay, name='decay')
        if epsilon is None:
            epsilon = K.epsilon()
        self.epsilon = epsilon
        self.initial_decay = decay
        self.amsgrad = amsgrad
        self.accum_iters = K.variable(accum_iters, K.dtype(self.iterations))
        self.accum_iters_float = K.cast(self.accum_iters, K.floatx())

    @interfaces.legacy_get_updates_support
    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]

        lr = self.lr

        completed_updates = K.cast(K.tf.floordiv(self.iterations, self.accum_iters), K.floatx())

        if self.initial_decay > 0:
            lr = lr * (1. / (1. + self.decay * completed_updates))

        t = completed_updates + 1

        lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) / (1. - K.pow(self.beta_1, t)))

        # self.iterations incremented after processing a batch
        # batch:              1 2 3 4 5 6 7 8 9
        # self.iterations:    0 1 2 3 4 5 6 7 8
        # update_switch = 1:        x       x    (if accum_iters=4)  
        update_switch = K.equal((self.iterations + 1) % self.accum_iters, 0)
        update_switch = K.cast(update_switch, K.floatx())

        ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        gs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]

        if self.amsgrad:
            vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        else:
            vhats = [K.zeros(1) for _ in params]

        self.weights = [self.iterations] + ms + vs + vhats

        for p, g, m, v, vhat, tg in zip(params, grads, ms, vs, vhats, gs):

            sum_grad = tg + g
            avg_grad = sum_grad / self.accum_iters_float

            m_t = (self.beta_1 * m) + (1. - self.beta_1) * avg_grad
            v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(avg_grad)

            if self.amsgrad:
                vhat_t = K.maximum(vhat, v_t)
                p_t = p - lr_t * m_t / (K.sqrt(vhat_t) + self.epsilon)
                self.updates.append(K.update(vhat, (1 - update_switch) * vhat + update_switch * vhat_t))
            else:
                p_t = p - lr_t * m_t / (K.sqrt(v_t) + self.epsilon)

            self.updates.append(K.update(m, (1 - update_switch) * m + update_switch * m_t))
            self.updates.append(K.update(v, (1 - update_switch) * v + update_switch * v_t))
            self.updates.append(K.update(tg, (1 - update_switch) * sum_grad))
            new_p = p_t

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            self.updates.append(K.update(p, (1 - update_switch) * p + update_switch * new_p))
        return self.updates

    def get_config(self):
        config = {'lr': float(K.get_value(self.lr)),
                  'beta_1': float(K.get_value(self.beta_1)),
                  'beta_2': float(K.get_value(self.beta_2)),
                  'decay': float(K.get_value(self.decay)),
                  'epsilon': self.epsilon,
                  'amsgrad': self.amsgrad}
        base_config = super(AdamAccumulate, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

It can be used in the following way:它可以通过以下方式使用：

opt = AdamAccumulate(lr=0.001, decay=1e-5, accum_iters=5)
model.compile( loss='categorical_crossentropy',   # Loss function
                            optimizer=opt,        # Optimization technique
                            metrics=['accuracy']) # Accuracy matrix
model.fit(X_train, y_train, batch_size = 10)

In this example, the model processes 10 samples in every iteration ("batch_size"), but the update to the weights only happens after accumulating 5 such batches ("accum_iters").在此示例中，模型在每次迭代中处理 10 个样本（“batch_size”），但权重的更新仅在累积 5 个这样的批次（“accum_iters”）后发生。 So the actual batch size for updating the weights is 50.所以更新权重的实际批量大小是 50。

Answer 2

We have published an open-source tool to automatically add gradient accumulation support in Keras models we implemented at Run:AI to help us with batch sizing issues.我们发布了一个开源工具，可以在我们在Run:AI 中实现的 Keras 模型中自动添加梯度累积支持，以帮助我们解决批量大小问题。

Using gradient accumulation in our models allowed us to use large batch sizes while being limited by GPU memory.在我们的模型中使用梯度累积使我们能够使用大批量，同时受到 GPU 内存的限制。 It specifically allowed us running neural networks with large batch sizes using only a single GPU.它特别允许我们仅使用单个 GPU 运行大批量的神经网络。

The project is available at https://github.com/run-ai/runai/tree/master/runai/ga along with explanations and examples you can use right out of the box.该项目可在https://github.com/run-ai/runai/tree/master/runai/ga获得，并附有说明和示例，您可以立即使用。

Using this tool, all you have to do is add a single line of code to your Python script, and you can add gradient accumulation support to your optimizer.使用此工具，您只需在 Python 脚本中添加一行代码，即可为优化器添加梯度累积支持。

The Python package is available at PyPI and can be installed using the command: pip install runai . Python 包在PyPI上可用，可以使用以下命令pip install runai ： pip install runai 。

Adding gradient accumulation support to Keras models is extremely easy.为 Keras 模型添加梯度累积支持非常容易。 First, import the package to your code: import runai.ga .首先，将包导入到您的代码中： import runai.ga 。 Then, you have to create a gradient accumulation optimizer.然后，您必须创建一个梯度累积优化器。 There are two ways to do so:有两种方法可以做到：

1. Wrap an existing Keras optimizer 1. 包装现有的 Keras 优化器

You can take any Keras optimizer - whether it's a built-in one (SGD, Adam, etc...) or a custom optimizer with your algorithm implementation - and add gradient accumulation support to it using the next line:您可以使用任何 Keras 优化器 - 无论是内置优化器（SGD、Adam 等...）还是带有算法实现的自定义优化器 - 并使用下一行为其添加梯度累积支持：

optimizer = runai.ga.keras.optimizers.Optimizer(optimizer, steps=STEPS)

Where optimizer is your optimizer, and STEPS is the number of steps you want to accumulate gradients over.其中optimizer是您的优化器，而STEPS是您想要累积梯度的步骤数。

2. Create a gradient accumulation version of any of the built-ins optimizers 2. 创建任何内置优化器的梯度累积版本

There are gradient accumulation versions of all built-in optimizers (SGD, Adam, etc...) available in the package.包中提供了所有内置优化器（SGD、Adam 等）的梯度累积版本。 They can be created using this line:可以使用以下行创建它们：

optimizer = runai.ga.keras.optimizers.Adam(steps=STEPS)

Here, we create a gradient accumulation version of Adam optimizer, and we accumulate gradients over STEPS steps.在这里，我们创建了Adam优化器的梯度累积版本，并在STEPS步骤上累积梯度。

More information, explanations, and examples are available in GitHub . GitHub中提供了更多信息、解释和示例。

In addition to the open-source tool itself, we have published a series of 3 articles on Towards Data Science (Medium), where we explained issues when using large batch sizes, what is gradient accumulation and how can it help in solving these issues, how it works, and how we implemented it.除了开源工具本身，我们还在 Towards Data Science (Medium) 上发表了 3 篇系列文章，其中解释了使用大批量时的问题、什么是梯度累积以及它如何帮助解决这些问题，它是如何工作的，以及我们如何实施它。 Here are links to the articles:以下是文章的链接：

Let us know if the tool helped you in using gradient accumulation in your own Keras models.让我们知道该工具是否帮助您在您自己的 Keras 模型中使用梯度累积。 We are here to give any support and help with the problems you encounter when using it in your own models.我们在这里为您在自己的模型中使用它时遇到的问题提供任何支持和帮助。

Answer 3

A more convenient way is to inject some changes into the existing optimizer.更方便的方法是将一些更改注入现有优化器。

class AccumOptimizer(Optimizer):
    """Inheriting Optimizer class, wrapping the original optimizer
    to achieve a new corresponding optimizer of gradient accumulation.
    # Arguments
        optimizer: an instance of keras optimizer (supporting
                    all keras optimizers currently available);
        steps_per_update: the steps of gradient accumulation
    # Returns
        a new keras optimizer.
    """
    def __init__(self, optimizer, steps_per_update=1, **kwargs):
        super(AccumOptimizer, self).__init__(**kwargs)
        self.optimizer = optimizer
        with K.name_scope(self.__class__.__name__):
            self.steps_per_update = steps_per_update
            self.iterations = K.variable(0, dtype='int64', name='iterations')
            self.cond = K.equal(self.iterations % self.steps_per_update, 0)
            self.lr = self.optimizer.lr
            self.optimizer.lr = K.switch(self.cond, self.optimizer.lr, 0.)
            for attr in ['momentum', 'rho', 'beta_1', 'beta_2']:
                if hasattr(self.optimizer, attr):
                    value = getattr(self.optimizer, attr)
                    setattr(self, attr, value)
                    setattr(self.optimizer, attr, K.switch(self.cond, value, 1 - 1e-7))
            for attr in self.optimizer.get_config():
                if not hasattr(self, attr):
                    value = getattr(self.optimizer, attr)
                    setattr(self, attr, value)
            # Cover the original get_gradients method with accumulative gradients.
            def get_gradients(loss, params):
                return [ag / self.steps_per_update for ag in self.accum_grads]
            self.optimizer.get_gradients = get_gradients
    def get_updates(self, loss, params):
        self.updates = [
            K.update_add(self.iterations, 1),
            K.update_add(self.optimizer.iterations, K.cast(self.cond, 'int64')),
        ]
        # gradient accumulation
        self.accum_grads = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        grads = self.get_gradients(loss, params)
        for g, ag in zip(grads, self.accum_grads):
            self.updates.append(K.update(ag, K.switch(self.cond, ag * 0, ag + g)))
        # inheriting updates of original optimizer
        self.updates.extend(self.optimizer.get_updates(loss, params)[1:])
        self.weights.extend(self.optimizer.weights)
        return self.updates
    def get_config(self):
        iterations = K.eval(self.iterations)
        K.set_value(self.iterations, 0)
        config = self.optimizer.get_config()
        K.set_value(self.iterations, iterations)
        return config

usage:用法：

opt = AccumOptimizer(Adam(), 10) # 10 is accumulative steps
model.compile(loss='mse', optimizer=opt)
model.fit(x_train, y_train, epochs=10, batch_size=10)

reference: https://github.com/bojone/accum_optimizer_for_keras参考： https : //github.com/bojone/accum_optimizer_for_keras

如何在 Keras 中为大批量累积梯度

问题描述

3 个解决方案

解决方案1
15 已采纳 2019-03-21 13:26:42

解决方案2
8 2020-01-23 09:18:13

解决方案3
4 2019-07-09 06:57:28

如何在 Keras 中为大批量累积梯度

问题描述

3 个解决方案

解决方案1 15 已采纳 2019-03-21 13:26:42

解决方案2 8 2020-01-23 09:18:13

解决方案3 4 2019-07-09 06:57:28

解决方案1
15 已采纳 2019-03-21 13:26:42

解决方案2
8 2020-01-23 09:18:13

解决方案3
4 2019-07-09 06:57:28