简体   繁体   English

批量标准化和辍学的顺序?

[英]Ordering of batch normalization and dropout?

The original question was in regard to TensorFlow implementations specifically.最初的问题是关于 TensorFlow 实现的。 However, the answers are for implementations in general.但是,答案是针对一般实现的。 This general answer is also the correct answer for TensorFlow.这个通用答案也是 TensorFlow 的正确答案。

When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?在 TensorFlow 中使用批量标准化和 dropout(特别是使用 contrib.layers)时,我需要担心排序吗?

It seems possible that if I use dropout followed immediately by batch normalization there might be trouble.如果我使用 dropout,然后立即进行批量标准化,这似乎可能会出现问题。 For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off.例如,如果批量归一化训练到较大规模的训练输出,但同样的转移应用于较小的(由于有更多输出的补偿)规模数字而在测试期间没有丢失,那么班次可能已关闭。 Does the TensorFlow batch normalization layer automatically compensate for this? TensorFlow 批量归一化层会自动对此进行补偿吗? Or does this not happen for some reason I'm missing?还是由于某种原因我失踪了?

Also, are there other pitfalls to look out for in when using these two together?另外,将这两者一起使用时是否还有其他需要注意的陷阱? For example, assuming I'm using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers?例如,假设我在上述方面以正确的顺序使用它们(假设顺序正确),在多个连续层上同时使用批量标准化和 dropout 会不会有问题? I don't immediately see a problem with that, but I might be missing something.我没有立即发现问题,但我可能会遗漏一些东西。

Thank you much!非常感谢!

UPDATE:更新:

An experimental test seems to suggest that ordering does matter.一项实验测试似乎表明排序确实很重要。 I ran the same network twice with only the batch norm and dropout reverse.我只使用批处理规范和 dropout 反向运行了两次相同的网络。 When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down.当 dropout 在批规范之前,验证损失似乎随着训练损失的下降而上升。 They're both going down in the other case.在另一种情况下,它们都在下降。 But in my case the movements are slow, so things may change after more training and it's just a single test.但在我的情况下,动作很慢,所以经过更多的训练后情况可能会发生变化,这只是一次测试。 A more definitive and informed answer would still be appreciated.仍将不胜感激更明确和明智的答案。

In the Ioffe and Szegedy 2015 , the authors state that "we would like to ensure that for any parameter values, the network always produces activations with the desired distribution".Ioffe 和 Szegedy 2015中,作者指出“我们希望确保对于任何参数值,网络始终产生具有所需分布的激活”。 So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation.因此,批归一化层实际上是在 Conv 层/全连接层之后插入的,但在输入 ReLu(或任何其他类型的)激活之前。 See this video at around time 53 min for more details.有关更多详细信息,请在 53 分钟左右观看此视频

As far as dropout goes, I believe dropout is applied after activation layer.就 dropout 而言,我相信 dropout 是在激活层之后应用的。 In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.dropout 论文图 3b 中,隐藏层 l 的 dropout 因子/概率矩阵 r(l) 应用于 y(l) 上,其中 y(l) 是应用激活函数 f 后的结果。

So in summary, the order of using batch normalization and dropout is:所以综上所述,使用batch normalization和dropout的顺序是:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> -> CONV/FC -> BatchNorm -> ReLu(或其他激活)-> Dropout -> CONV/FC ->

As noted in the comments, an amazing resource to read up on the order of layers is here .正如评论中所指出的,阅读层顺序的惊人资源是here I have gone through the comments and it is the best resource on topic i have found on internet我浏览了评论,这是我在互联网上找到的最佳主题资源

My 2 cents:我的 2 美分:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. Dropout 旨在完全阻止来自某些神经元的信息,以确保神经元不会共同适应。 So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.因此,批量标准化必须在 dropout 之后进行,否则您将通过标准化统计信息传递信息。

If you think about it, in typical ML problems, this is the reason we don't compute mean and standard deviation over entire data and then split it into train, test and validation sets.如果您考虑一下,在典型的 ML 问题中,这就是我们不计算整个数据的均值和标准差,然后将其拆分为训练、测试和验证集的原因。 We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets我们拆分然后计算训练集上的统计数据,并使用它们对验证和测试数据集进行归一化和居中

so i suggest Scheme 1 (This takes pseudomarvin's comment on accepted answer into consideration)所以我建议方案1(这考虑了pseudomarvin对已接受答案的评论)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC -> CONV/FC -> ReLu(或其他激活)-> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2与方案 2 不同

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer -> CONV/FC -> BatchNorm -> ReLu(或其他激活) -> Dropout -> CONV/FC -> 在接受的答案中

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2请注意,这意味着与方案 1 下的网络相比,方案 2 下的网络应该显示过拟合,但 OP 运行了一些问题中提到的测试,它们支持方案 2

Usually, Just drop the Dropout (when you have BN ):通常,只需放弃Dropout (当你有BN时):

  • "BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively" “BN 在某些情况下消除了对Dropout的需求,因为 BN 直观地提供了与 Dropout 相似的正则化优势”
  • "Architectures like ResNet, DenseNet, etc. not using Dropout “ResNet、DenseNet 等架构不使用Dropout

For more details, refer to this paper [ Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift ] as already mentioned by @Haramoz in the comments.有关更多详细信息,请参阅@Haramoz 在评论中已经提到的这篇论文 [通过 Variance Shift 了解 Dropout 和 Batch Normalization 之间的不和谐]。

Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285 Conv - 激活 - DropOut - BatchNorm - 池 --> Test_loss: 0.04261355847120285

Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396转换 - 激活 - 丢弃 - 池 - BatchNorm --> Test_loss: 0.050065308809280396

Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144 Conv - 激活 - BatchNorm - 池 - DropOut --> Test_loss: 0.04911309853196144

Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665转换 - 激活 - BatchNorm - DropOut - 池 --> Test_loss: 0.06809622049331665

Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536 Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536

Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491 Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491

Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332 Conv - BatchNorm - DropOut - 激活 - 池 --> Test_loss: 0.05142546817660332

Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568 Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568

Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951 Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951

Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556 Conv - DropOut - BatchNorm - 激活 - 池 --> Test_loss: 0.03238215297460556


Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with使用 2 个卷积模块(见下文)在 MNIST 数据集(20 个 epoch)上进行训练,每次都使用

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3) , default padding, the activation is elu .卷积层的内核大小为(3,3) ,默认填充,激活为elu The Pooling is a MaxPooling of the poolside (2,2) . Pooling 是池边(2,2)的 MaxPooling。 Loss is categorical_crossentropy and the optimizer is adam .损失是categorical_crossentropy ,优化器是adam

The corresponding Dropout probability is 0.2 or 0.3 , respectively.相应的 Dropout 概率分别为0.20.3 The amount of feature maps is 32 or 64 , respectively.特征图的数量分别为3264

Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.编辑:当我按照一些答案中的建议放弃 Dropout 时,它收敛得更快,但泛化能力比我使用 BatchNormDropout 时更差。

I found a paper that explains the disharmony between Dropout and Batch Norm(BN).我发现一篇论文解释了 Dropout 和 Batch Norm(BN) 之间的不和谐。 The key idea is what they call the "variance shift" .关键思想是他们所谓的“方差转移” This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns.这是因为 dropout 在训练和测试阶段具有不同的行为,这会改变 BN 学习的输入统计信息。 The main idea can be found in this figure which is taken from this paper .主要思想可以在这张取自这篇论文的图中找到。 在此处输入图像描述

A small demo for this effect can be found in this notebook .可以在这个笔记本中找到这个效果的一个小演示。

I read the recommended papers in the answer and comments from https://stackoverflow.com/a/40295999/8625228我在https://stackoverflow.com/a/40295999/8625228的答案和评论中阅读了推荐的论文

From Ioffe and Szegedy (2015)'s point of view, only use BN in the network structure.从 Ioffe 和 Szegedy (2015) 的角度来看,网络结构中只使用了 BN。 Li et al.李等人。 (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. (2018) 给出了统计和实验分析,当从业者在 BN 之前使用 Dropout 时,存在方差偏移。 Thus, Li et al.因此,李等人。 (2018) recommend applying Dropout after all BN layers. (2018) 建议在所有 BN 层之后应用 Dropout。

From Ioffe and Szegedy (2015)'s point of view, BN is located inside/before the activation function.从 Ioffe 和 Szegedy (2015) 的角度来看,BN 位于激活函数内部/之前 However, Chen et al.然而,陈等人。 (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) 使用结合了 dropout 和 BN 的 IC 层,Chen 等人。 (2019) recommends use BN after ReLU. (2019) 建议在 ReLU 之后使用 BN。

On the safety background, I use Dropout or BN only in the network.在安全背景上,我只在网络中使用 Dropout 或 BN。

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. Chen、Guangyong、Pengfei Chen、Yujun Shi、Chang-Ysieh、Benben Liao 和 Shengyu Zhang。 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” 2019.“重新思考批量标准化和 Dropout 在深度神经网络训练中的使用”。 CoRR abs/1905.05928. CoRR abs/1905.05928。 http://arxiv.org/abs/1905.05928 . http://arxiv.org/abs/1905.05928

Ioffe, Sergey, and Christian Szegedy.约夫、谢尔盖和克里斯蒂安·塞格迪。 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” 2015. “批量标准化:通过减少内部协变量偏移来加速深度网络训练。” CoRR abs/1502.03167. CoRR abs/1502.03167。 http://arxiv.org/abs/1502.03167 . http://arxiv.org/abs/1502.03167

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang.李、向、陈硕、胡晓林和杨健。 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” 2018.“通过方差转移了解辍学和批量标准化之间的不和谐。” CoRR abs/1801.05134. CoRR abs/1801.05134。 http://arxiv.org/abs/1801.05134 . http://arxiv.org/abs/1801.05134

根据研究论文以获得更好的性能,我们应该在应用 Dropouts 之前使用 BN

正确的顺序是:Conv > Normalization > Activation > Dropout > Pooling

ConV/FC - BN - Sigmoid/tanh - dropout. ConV/FC - BN - Sigmoid/tanh - dropout。 If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task如果激活函数是 Relu 或其他,规范化和 dropout 的顺序取决于您的任务

After Reading Multiple Answers And Conducting Some Tests, These Are My Hypothesis在阅读了多个答案并进行了一些测试后,这些是我的假设

a) Always BN -> AC , (Nothing b/w them). a) 总是BN -> AC ,(没有 b/w 他们)。
b) BN -> Dropout over Dropout -> BN , but try both. b) BN -> Dropout over Dropout -> BN ,但都尝试一下。 [Newer research, finds 1st better ] [较新的研究,发现第一个更好]
c) BN eliminates the need of Dropout , no need to use Dropout . c) BN消除了Dropout的需要,不需要使用Dropout
d) Pool in the end. d)到底。
e) BN before Dropout is data Leakage. e) Dropout前的BN是数据泄露。
f) Best thing is to try every combination. f) 最好的办法是尝试每一种组合。


SO CALLED BEST METHOD -所谓的最佳方法 -

Layer -> BN -> AC -> Dropout -> Pool -> Layer-> BN -> AC -> Dropout ->->

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM