简体繁体 English

BatchNormalization 是跨批次使用移动平均值还是仅使用每个批次？以及如何跨批次使用移动平均线？

[英]Does BatchNormalization use moving average across batches or only per batch? and how to use moving average across batches?

原文 2020-02-28 23:01:56 9 1 python/ tensorflow/ machine-learning/ deep-learning/ batch-normalization

As the the title says, I'm wondering if each mini-batch normalization happens based only on that mini-batche's own statistics or does it use moving averages/statistics across mini-batches ( during training )?正如标题所说，我想知道每个小批量归一化是否仅基于小批量自己的统计数据发生，还是使用跨小批量的移动平均值/统计数据（在训练期间）？

Also, is there a way to force batch normalization to use moving averages/statistics across batches?另外，有没有办法强制批量标准化以跨批次使用移动平均值/统计数据？

The motivation is that because of memory limitations, my batch size is quite small.动机是由于内存限制，我的批量大小非常小。

Thanks in advance.提前致谢。

1 个解决方案

Each mini-batch normalization happens based only on that mini-batche's own statistics .每个小批量标准化仅基于该小批量自己的统计数据发生。

to use moving averages/statistics across batches: Batch renormalization is another interesting approach for applying batch normalization to small batch sizes.跨批次使用移动平均值/统计数据：批次重归一化是另一种将批次归一化应用于小批量的有趣方法。 The basic idea behind batch renormalization comes from the fact that we do not use the individual mini-batch statistics for batch normalization during inference.批量重归一化背后的基本思想来自这样一个事实，即我们在推理过程中不使用单个小批量统计数据进行批量归一化。 Instead, we use a moving average of the mini batch statistics.相反，我们使用小批量统计数据的移动平均值。 This is because a moving average provides a better estimate of the true mean and variance compared to individual mini-batches.这是因为与单个小批量相比，移动平均值提供了对真实均值和方差的更好估计。

Then why don't we use the moving average during training?那我们为什么不在训练时使用移动平均线呢？ The answer has to do with the fact that during training, we need to perform backpropagation.答案与这样一个事实有关，即在训练期间，我们需要执行反向传播。 In essence, when we use some statistics to normalize the data, we need to backpropagate through those statistics as well.本质上，当我们使用一些统计数据来规范化数据时，我们也需要通过这些统计数据进行反向传播。 If we use the statistics of activations from previous mini-batches to normalize the data, we need to account for how the previous layer affected those statistics during backpropagation.如果我们使用之前小批量激活的统计数据来规范化数据，我们需要考虑前一层在反向传播过程中如何影响这些统计数据。 If we ignore these interactions, we could potentially cause previous layers to keep on increasing the magnitude of their activations even though it has no effect on the loss.如果我们忽略这些相互作用，我们可能会导致之前的层继续增加它们的激活量，即使它对损失没有影响。 This means that if we use a moving average, we would need to store the data from all previous mini-batches during training, which is far too expensive.这意味着如果我们使用移动平均线，我们需要在训练期间存储来自所有先前小批量的数据，这太昂贵了。

In batch renormalization, the authors propose to use a moving average while also taking the effect of previous layers on the statistics into account.在批量重整化中，作者建议使用移动平均数，同时还要考虑前几层对统计数据的影响。 Their method is - at its core - a simple reparameterization of normalization with the moving average.他们的方法 - 在其核心 - 使用移动平均线对归一化进行简单的重新参数化。 If we denote the moving average mean and standard deviation as 'mu' and 'sigma' and the mini-batch mean and standard deviation as mu_B and sigma_B , the batch renormalization equation is:如果我们将移动平均均值和标准差表示为 'mu' 和 'sigma'，将小批量均值和标准差表示为 mu_B 和 sigma_B ，则批量重整化方程为：

In other words, we multiply the batch normalized activations by r and add d , where both r and d are computed from the minibatch statistics and moving average statistics.换句话说，我们将批量标准化激活乘以 r 并添加 d ，其中 r 和 d 均由小批量统计数据和移动平均统计数据计算得出。 The trick here is to not backpropagate through r and d .这里的诀窍是不要通过 r 和 d 进行反向传播。 Though this means we ignore some of the effects of previous layers on previous mini batches, since the mini batch statistics and moving average statistics should be the same on average, the overall effect of this should cancel out on average as well.虽然这意味着我们忽略了先前层对先前小批量的一些影响，但由于小批量统计和移动平均统计平均应该相同，因此总体影响也应该平均抵消。

Unfortunately, batch renormalization's performance still degrades when the batch size decreases (though not as badly as batch normalization), meaning group normalization still has a slight advantage in the small batch size regime.不幸的是，当批大小减小时，批重新归一化的性能仍然会下降（尽管没有批归一化那么严重），这意味着组归一化在小批大小的情况下仍然具有轻微的优势。

Kindly refer this link for clarification on this 请参考此链接以对此进行澄清