简体繁体 English

为什么在微调时需要冻结 Batch Normalization 层的所有内部 state

[英]Why it's necessary to frozen all inner state of a Batch Normalization layer when fine-tuning

原文 2020-07-21 14:26:10 5 1 python/ tensorflow/ keras/ tensorflow2.0/ batch-normalization

The following content comes from Keras tutorial以下内容来自Keras教程

This behavior has been introduced in TensorFlow 2.0, in order to enable layer.trainable = False to produce the most commonly expected behavior in the convnet fine-tuning use case.此行为已在 TensorFlow 2.0 中引入，以便启用 layer.trainable = False 以在 convnet 微调用例中产生最常见的预期行为。

Why we should freeze the layer when fine-tuning a convolutional neural network?为什么在微调卷积神经网络时要冻结层？ Is it because some mechanisms in tensorflow keras or because of the algorithm of batch normalization?是因为 tensorflow keras 中的一些机制，还是因为批量归一化的算法？ I run an experiment myself and I found that if trainable is not set to false the model tends to catastrophic forgetting what has been learned before and returns very large loss at first few epochs.我自己进行了一个实验，我发现如果 trainable 没有设置为 false，model 往往会灾难性地忘记之前学到的东西，并在最初的几个时期返回非常大的损失。 What's the reason for that?这是什么原因？

1 个解决方案

During training, varying batch statistics act as a regularization mechanism that can improve ability to generalize.在训练过程中，不同的批次统计数据作为一种正则化机制，可以提高泛化能力。 This can help to minimize overfitting when training for a high number of iterations.这有助于在训练大量迭代时最大限度地减少过度拟合。 Indeed, using a very large batch size can harm generalization as there is less variation in batch statistics, decreasing regularization.事实上，使用非常大的批大小会损害泛化，因为批统计的变化较小，从而降低了正则化。

When fine-tuning on a new dataset, batch statistics are likely to be very different if fine-tuning examples have different characteristics to examples in the original training dataset.在对新数据集进行微调时，如果微调示例与原始训练数据集中的示例具有不同的特征，则批处理统计数据可能会非常不同。 Therefore, if batch normalization is not frozen, the network will learn new batch normalization parameters (gamma and beta in the batch normalization paper ) that are different to what the other network paramaters have been optimised for during the original training.因此，如果批归一化没有被冻结，网络将学习新的批归一化参数（批归一化论文中的 gamma 和 beta），这些参数与原始训练期间其他网络参数已优化的参数不同。 Relearning all the other network parameters is often undesirable during fine-tuning, either due to the required training time or small size of the fine-tuning dataset.由于所需的训练时间或微调数据集的大小，在微调期间通常不希望重新学习所有其他网络参数。 Freezing batch normalization avoids this issue.冻结批量标准化可以避免这个问题。