简体   繁体   中英

Why doesn't a batch-normalized layer sum to 1?

I've been looking deeper into how batch norm works in PyTorch and have noticed that for the below code:

torch.manual_seed(0)
# With Learnable Parameters
m = nn.BatchNorm2d(1)
# Without Learnable Parameters
#m = nn.BatchNorm2d(1, affine=False)
input = torch.randn(2, 1, 2, 2)
output = m(input)
#print(input)
print(output)

the output below does not sum to 1:

tensor([[[[-0.1461, -0.0348],
          [ 0.4644, -0.0339]]],


        [[[ 0.6359, -0.0718],
          [-1.1104,  0.2967]]]], grad_fn=<NativeBatchNormBackward>)

It sums to 0 instead, and I'm guess this is because batch norm makes the mean 0 (unless the scale and shift params are added). Isn't batch normalization supposed to produce a distribution per channel across the batch?

I think you have BatchNorm confused with Softmax .

To answer your questions in the comments, normalization does not change the distribution - it simply centers it at 0 with unit variance.

For example, if the data was from a uniform distribution, it remains uniform after normalizing, albeit with different statistics.

For example, take the distribution below:

在此处输入图像描述


After normalizing, this is what the distribution looks like:

在此处输入图像描述

Notice that the shape of the overall distribution and number of samples in each bucket is exactly the same - what has changed is the mean value (ie, center) of the distribution. And though not visually obvious, one can check the new normalized values (X-axis of the plot) and see that the variance is approximately 1.

This is precisely what BatchNorm does, with the X-axis being each example in a batch. For other kinds of norms, the dimension taken to normalize over changes (eg, from the batch dimension to feature dimension in LayerNorm ), but the effect is essentially the same.

If you wanted probabilities, you could simply divide the size of each bin by the number of samples (scale the Y-axis instead of the X-axis), This would give a graph of the exact same shape, with the X-axis values the same as the original graph and the Y-axis values scaled to represent probabilities!


Let's now see what Softmax does to the distribution. Applying softmax over the distribution gives the following graph:

在此处输入图像描述

As you can see, softmax actually creates a probability distribution over the points, meaning, it gives a probability of how likely each point is assuming they all are sampled from a Gaussian distribution (the Gaussian part is important theoretically since that is what gives e in the softmax expression).

In contrast, simply scaling the Y-axis with the number of samples does not make the Gaussian assumption - it simply creates a distribution from the given points. Since the probability of any point outside this distribution will be 0, it is useless for generalization. Hence, softmax is used instead of simply creating probabilities out of sample points.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM