Keras 的 BatchNormalization 和 PyTorch 的 BatchNorm2d 的区别？

Question

I've a sample tiny CNN implemented in both Keras and PyTorch.我有一个在 Keras 和 PyTorch 中实现的示例微型 CNN。 When I print summary of both the networks, the total number of trainable parameters are same but total number of parameters and number of parameters for Batch Normalization don't match.当我打印两个网络的摘要时，可训练参数的总数相同但参数总数和批量标准化的参数数量不匹配。

Here is the CNN implementation in Keras:这是 Keras 中的 CNN 实现：

inputs = Input(shape = (64, 64, 1)). # Channel Last: (NHWC)

model = Conv2D(filters=32, kernel_size=(3, 3), padding='SAME', activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, 1))(inputs)
model = BatchNormalization(momentum=0.15, axis=-1)(model)
model = Flatten()(model)

dense = Dense(100, activation = "relu")(model)
head_root = Dense(10, activation = 'softmax')(dense)

And the summary printed for above model is:为上述模型打印的摘要是：

Model: "model_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_9 (InputLayer)         (None, 64, 64, 1)         0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 64, 64, 32)        320       
_________________________________________________________________
batch_normalization_2 (Batch (None, 64, 64, 32)        128       
_________________________________________________________________
flatten_3 (Flatten)          (None, 131072)            0         
_________________________________________________________________
dense_11 (Dense)             (None, 100)               13107300  
_________________________________________________________________
dense_12 (Dense)             (None, 10)                1010      
=================================================================
Total params: 13,108,758
Trainable params: 13,108,694
Non-trainable params: 64
_________________________________________________________________

Here's the implementation of the same model architecture in PyTorch:以下是 PyTorch 中相同模型架构的实现：

# Image format: Channel first (NCHW) in PyTorch
class CustomModel(nn.Module):
def __init__(self):
    super(CustomModel, self).__init__()
    self.layer1 = nn.Sequential(
        nn.Conv2d(in_channels=1, out_channels=32, kernel_size=(3, 3), padding=1),
        nn.ReLU(True),
        nn.BatchNorm2d(num_features=32),
    )
    self.flatten = nn.Flatten()
    self.fc1 = nn.Linear(in_features=131072, out_features=100)
    self.fc2 = nn.Linear(in_features=100, out_features=10)

def forward(self, x):
    output = self.layer1(x)
    output = self.flatten(output)
    output = self.fc1(output)
    output = self.fc2(output)
    return output

And following is the output of summary of the above model:以下是上述模型的摘要输出：

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 64, 64]             320
              ReLU-2           [-1, 32, 64, 64]               0
       BatchNorm2d-3           [-1, 32, 64, 64]              64
           Flatten-4               [-1, 131072]               0
            Linear-5                  [-1, 100]      13,107,300
            Linear-6                   [-1, 10]           1,010
================================================================
Total params: 13,108,694
Trainable params: 13,108,694
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 4.00
Params size (MB): 50.01
Estimated Total Size (MB): 54.02
----------------------------------------------------------------

As you can see in above results, Batch Normalization in Keras has more number of parameters than PyTorch (2x to be exact).正如您在上面的结果中看到的，Keras 中的 Batch Normalization 比 PyTorch 具有更多的参数（准确地说是 2 倍）。 So what's the difference in above CNN architectures?那么上面的CNN架构有什么不同呢？ If they are equivalent, then what am I missing here?如果它们是等效的，那么我在这里缺少什么？

Answer 1

Keras treats as parameters (weights) many things that will be "saved/loaded" in the layer. Keras 将许多将在层中“保存/加载”的内容视为参数（权重）。

While both implementations naturally have the accumulated "mean" and "variance" of the batches, these values are not trainable with backpropagation.虽然这两种实现自然具有批次的累积“均值”和“方差”，但这些值无法通过反向传播进行训练。

Nevertheless, these values are updated every batch, and Keras treats them as non-trainable weights, while PyTorch simply hides them.尽管如此，这些值每批次都会更新，Keras 将它们视为不可训练的权重，而 PyTorch 只是将它们隐藏起来。 The term "non-trainable" here means "not trainable by backpropagation ", but doesn't mean the values are frozen.这里的术语“不可训练”是指“不可通过反向传播训练”，但并不意味着值被冻结。

In total they are 4 groups of "weights" for a BatchNormalization layer.总的来说，它们是BatchNormalization层的 4 组“权重”。 Considering the selected axis (default = -1, size=32 for your layer)考虑选定的轴（默认 = -1，层大小 = 32）

scale (32) - trainable scale (32) - 可训练
offset (32) - trainable offset (32) - 可训练
accumulated means (32) - non-trainable, but updated every batch accumulated means (32) - 不可训练，但每批次更新
accumulated std (32) - non-trainable, but updated every batch accumulated std (32) - 不可训练，但每批次更新

The advantage of having it like this in Keras is that when you save the layer, you also save the mean and variance values the same way you save all other weights in the layer automatically.在 Keras 中这样做的好处是，当您保存图层时，您还可以像自动保存图层中的所有其他权重一样保存均值和方差值。 And when you load the layer, these weights are loaded together.当您加载图层时，这些权重会一起加载。

Keras 的 BatchNormalization 和 PyTorch 的 BatchNorm2d 的区别？

问题描述

1 个解决方案

解决方案1
11 已采纳 2020-02-05 16:21:59

Keras 的 BatchNormalization 和 PyTorch 的 BatchNorm2d 的区别？

问题描述

1 个解决方案

解决方案1 11 已采纳 2020-02-05 16:21:59

解决方案1
11 已采纳 2020-02-05 16:21:59