简体   繁体   English

关于 DNN model 中 Dropout 层和 Batch Normalization 层的问题

[英]Question About Dropout Layer and Batch Normalization Layer in DNN model

I have some queries about the Dropout layer and Batch normalized layer.我对 Dropout 层和 Batch 标准化层有一些疑问。 Basically, I have made a simple DNN structure with a Dropout layer and Batch normalized layer and train it that's fine.基本上,我制作了一个带有 Dropout 层和 Batch 归一化层的简单 DNN 结构,并对其进行训练就可以了。

The simple structure of DNN model for example:以 DNN model 的简单结构为例:

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(10, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(8, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(6, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1,activation='softmax'),
])

model.compile(
    optimizer='adam',
    loss='mae',
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=100,
    verbose=0,
)

But now I would like to use the train model's weights and bias of all layers in my custom prediction model(forget about the other way).但是现在我想在我的自定义预测模型中使用所有层的训练模型的权重和偏差(忘记另一种方式)。

# Predictions for test
test_logits_1 = tf.matmul(tf_test_dataset, weights_1) + biases_1
test_relu_1 = tf.nn.relu(test_logits_1)

test_logits_2 = tf.matmul(test_relu_1, weights_2) + biases_2
test_relu_2 = tf.nn.relu(test_logits_2)

test_logits_3 = tf.matmul(test_relu_2, weights_3) + biases_3
test_relu_3 = tf.nn.relu(test_logits_3)

test_logits_4 = tf.matmul(test_logits_3 , weights_4) + biases_4
test_prediction = tf.nn.softmax(test_relu_4)

Now the question is here: have to need to add the dropout layer and batch normalized layer, batch size in the prediction model??现在问题来了:必须在预测model中添加dropout层和batch normalized层,batch size? If yes then why to do that and how do I extract all the details of layers and use them in my custom prediction model?如果是,那么为什么要这样做以及如何提取图层的所有细节并将它们用于我的自定义预测 model?

@Dr. @博士。 Snoopy thanks for pointing out that the BatchNormalization has parameters but to my knowledge they are not the normalization weights(weights being normalized) based on what I was able to deduce from the docs and little research.史努比感谢您指出 BatchNormalization 具有参数,但据我所知,它们不是基于我从文档和少量研究中得出的归一化权重(归一化的权重)。

The doc says the following(quoted text below) and based on the description it is clear that beta and gamma values are trainable variables which tallies with the output from tensorflow.文档说明了以下内容(下面引用了文本),并且根据描述,很明显betagamma值是可训练的变量,与 tensorflow 的 output 相符。

During training (ie when using fit() or when calling the layer/model with the argument training=True), the layer normalizes its output using the mean and standard deviation of the current batch of inputs.在训练期间(即使用 fit() 或使用参数 training=True 调用层/模型时),层使用当前批次输入的均值和标准差对其 output 进行归一化。 That is to say, for each channel being normalized, the layer returns (batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta, where:也就是说,对于每个被归一化的通道,该层返回 (batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta,其中:

  • epsilon is small constant (configurable as part of the constructor arguments) epsilon 是一个小常数(可配置为构造函数参数的一部分)
  • gamma is a learned scaling factor (initialized as 1), which can be disabled by passing scale=False to the constructor. gamma 是一个学习的缩放因子(初始化为 1),可以通过将 scale=False 传递给构造函数来禁用它。
  • beta is a learned offset factor (initialized as 0), which can be disabled by passing center=False to the constructor. beta 是一个学习的偏移因子(初始化为 0),可以通过将 center=False 传递给构造函数来禁用它。

在此处输入图像描述

But that is not the end of the story as the model summary indicates more parameters than the number of parameters beta and gamma comprise of.但这并不是故事的结局,因为 model 总结表明参数数量超过了betagamma所包含的参数数量。

在此处输入图像描述

A factor of 4 can be observed here ie the number of parameters in a BatchNormalization layer are 4 times the input shape the layer operates on.这里可以观察到因子4 ,即 BatchNormalization 层中的参数数量是该层操作的输入形状的 4 倍。

These additional parameters are moving_mean and moving_variance values which can be seen in the following output这些附加参数是moving_meanmoving_variance值,可以在下面的output中看到

在此处输入图像描述

Coming back to the original question and concern of OP, "What parameters should i worry about?", the parameters that are needed for inference are moving_mean , moving_variance , beta , and gamma values.回到 OP 最初的问题和关注点,“我应该担心哪些参数?”,推理所需的参数是Moving_meanMoving_variancebetagamma值。

The way to use these values/parameters is again easily deducible from the docs which I quote here again-使用这些值/参数的方式很容易从我在这里再次引用的文档中推断出来-

During inference (ie when using evaluate() or predict() or when calling the layer/model with the argument training=False (which is the default), the layer normalizes its output using a moving average of the mean and standard deviation of the batches it has seen during training. That is to say, it returns (batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta.在推理过程中(即使用 evaluate() 或 predict() 或使用参数 training=False(这是默认值)调用层/模型时,层使用均值和标准差的移动平均值对其 output 进行归一化。它在训练期间看到的批次。也就是说,它返回 (batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta。

self.moving_mean and self.moving_var are non-trainable variables that are updated each time the layer in called in training mode, as such: self.moving_mean 和 self.moving_var 是不可训练的变量,每次在训练模式下调用层时都会更新,如下所示:

  • moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)移动均值 = 移动均值 * 动量 + 均值(批次)*(1 - 动量)
  • moving_var = moving_var * momentum + var(batch) * (1 - momentum)移动变量 = 移动变量 * 动量 + var(batch) * (1 - 动量)

As such, the layer will only normalize its inputs during inference after having been trained on data that has similar statistics as the inference data.因此,该层仅在对具有与推理数据相似统计数据的数据进行训练后,才会在推理期间对其输入进行归一化。

So assuming the moving_mean , moving_variance , beta , and gamma values are available for every BatchNormalization layer, I think the following piece of code needs to be added after the first activation-因此,假设每个 BatchNormalization 层都可以使用moving_meanmoving_variancebetagamma值,我认为在第一次激活后需要添加以下代码 -

# epsilon is just to avoid ZeroDivisionError, so the default value should be okay
test_BN_1 = (test_relu_1 - moving_mean_1) / (moving_var_1 + epsilon_1) * gamma_1 + beta_1

EDIT:编辑:

Turns out that the documentation seems to be wrong but the implementation seems to be right based on what I could deduce from the source code on github.结果证明文档似乎是错误的,但根据我可以从 github 上的源代码中推断出的内容,实施似乎是正确的。

If you follow the following links you'll see that the in call method of BatchNormalization class here https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L1227 the calculation is actually done by keras backend normalization function batch_normalization here https://github.com/keras-team/keras/blob/35146d00b44ca645fbf4ad0b007faa07632c6f9e/keras/backend.py#L2963 .如果您按照以下链接,您将看到BatchNormalization class 这里https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L1227call方法实际上是done by keras backend normalization function batch_normalization here https://github.com/keras-team/keras/blob/35146d00b44ca645fbf4ad0b007faa07632c6f9e/keras/backend.py#L2963 . The backend function doc string seems to be in agreement with what is mentioned in the reference paper and the picture you've posted.后端 function 文档字符串似乎与参考文件中提到的内容和您发布的图片一致。

So that means, you should use the square root of the variance only.这意味着,您应该只使用方差的平方根。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM