简体繁体 English

澄清NN残留层反向传播推导

[英]Clarification on NN residual layer back-prop derivation

原文 2017-10-09 05:09:59 2 1 neural-network/ deep-learning/ bigdata/ deep-residual-networks/ generic-derivation

I've looked everywhere and can't find anything that explains the actual derivation of backprop for residual layers. 我四处张望，找不到任何可以解释残余层反向传播的实际推导的内容。 Here's my best attempt and where I'm stuck. 这是我的最佳尝试，也是我遇到的困难。 It is worth mentioning that the derivation that I'm hoping for is from a generic perspective that need not be limited to convolutional NNs. 值得一提的是，我希望得到的推导是从通用角度出发的，不必局限于卷积神经网络。

If the formula for calculating the output of a normal hidden layer is F(x) then the formula for a hidden layer with a residual connection is F(x) + o, where x is the weight adjusted output of a previous layer, o is the output of a previous layer, and F is the activation function. 如果用于计算常规隐藏层输出的公式为F（x），则具有剩余连接的隐藏层的公式为F（x）+ o，其中x是上一层的权重调整输出，o为上一层的输出，F是激活函数。 To get the delta for a normal layer during back-propagation one needs to calculate the gradient of the output ∂F(x)/∂x. 为了在反向传播过程中获得正常层的增量，需要计算输出的梯度F（x）/ x。 For a residual layer this is ∂(F(x) + o)/∂x which is separable into ∂F(x)/∂x + ∂o/∂x (1). 对于残差层，这是∂（F（x）+ o）/∂x，可分为∂F（x）/∂x+∂o/∂x（1）。

If all of this is correct, how does one deal with ∂o/∂x? 如果所有这些都是正确的，那么如何处理∂o/∂x？ It seems to me that it depends on how far back in the network o comes from. 在我看来，这取决于o的网络距离。

If o is just from the previous layer then o*w=x where w are the weights connecting the previous layer to the layer for F(x). 如果o刚好来自上一层，则o * w = x其中w是将上一层连接到F（x）层的权重。 Taking the derivative of each side relative to o gives ∂(o*w)/∂o = ∂x/∂o, and the result is w = ∂x/do which is just the inverse of the term that comes out at (1) above. 取相对于o的每一边的导数得到∂（o * w）/∂o=∂x/∂o，结果是w =∂x/ do，它只是在（1 ）以上。 Does it make sense that in this case the gradient of the residual layer is just ∂F(x)/∂x + 1/w ? 在这种情况下，残留层的梯度仅为∂F（x）/∂x+ 1 / w有意义吗？ Is it accurate to interpret 1/w as a matrix inverse? 将1 / w解释为矩阵逆是否准确？ If so then is that actually getting computed by NN frameworks that use residual connections or is there some shortcut that is for adding in the error from the residual? 如果是这样，那是使用残差连接的NN框架实际计算出来的，还是有一些捷径可用来添加残差中的错误？
If o is from further back in the network then, I think, the derivation becomes slightly more complicated. 我认为，如果o来自网络的另一端，则推导会变得稍微复杂一些。 Here is an example where the residual comes from one layer further back in a network. 这是一个示例，其中残差来自网络中更远的一层。 The network architecture is Input--w1--L1--w2--L2--w3--L3--Out, having a residual connection from the L1 to L3 layers. 网络体系结构是Input--w1-L1-w2--L2--w3--L3-Out，具有从L1到L3层的剩余连接。 The symbol o from the first example is replaced by the layer output L1 for unambiguity. 为了明确起见，第一个示例中的符号o被图层输出L1取代。 We are trying to calculate the gradient at L3 during back-prop which has a forward function of F(x)+L1 where x=F(F(L1*w2)*w3). 我们正在尝试计算反向传播过程中L3处的梯度，该函数具有F（x）+ L1的正向函数，其中x = F（F（L1 * w2）* w3）。 The derivative of this relationship is ∂x/∂L1=∂F(F(L1*w2)*w3/∂L1, which is more complicated but doesn't seem too difficult to solve numerically. 此关系的导数为∂x/∂L1=∂F（F（L1 * w2）* w3 /∂L1，这虽然更复杂，但似乎在数值上求解起来并不难。

If the above derivation is reasonable then it's worth noting that there is a case when the derivation fails, and that is when a residual connection originates from the Input layer. 如果上述推导是合理的，则值得注意的是，在某些情况下推导失败，即当残余连接源自输入层时。 This is because the input cannot be broken down into ao*w=x expression (where x would be the input values). 这是因为无法将输入分解为ao * w = x表达式（其中x为输入值）。 I think this must suggest that residual layers cannot originate from from the input layer, but since I've seen network architecture diagrams that have residual connections that originate from the input, this casts my above derivations into doubt. 我认为这必须表明残余层不能源自输入层，但是由于我已经看到了具有源自输入的残余连接的网络体系结构图，因此上述推论令人怀疑。 I can't see where I've gone wrong though. 我看不到哪里出了问题。 If anyone can provide a derivation or code sample for how they calculate the gradient at residual merge points correctly, I would be deeply grateful. 如果有人可以提供一个派生或代码示例来说明他们如何正确计算残留合并点处的梯度，我将不胜感激。

EDIT: 编辑：

The core of my question is, when using residual layers and doing vanilla back-propagation, is there any special treatment of the error at the layers where residuals are added? 我的问题的核心是，当使用残差层并进行香草反向传播时，在添加残差的层上是否对误差进行了特殊处理？ Since there is a 'connection' between the layer where the residual comes from and the layer where it is added, does the error need to get distributed backwards over this 'connection'? 由于在残差来自的层和添加残差的层之间存在“连接”，是否需要在该“连接”上向后分配错误？ My thinking is that since residual layers provide raw information from the beginning of the network to deeper layers, the deeper layers should provide raw error to the earlier layers. 我的想法是，由于残留层从网络的开始到较深的层都提供原始信息，因此较深的层应为较早的层提供原始错误。

Based on what I've seen (reading the first few pages of googleable forums, reading the essential papers, and watching video lectures) and Maxim's post down below, I'm starting to think that the answer is that ∂o/∂x = 0 and that we treat o as a constant. 根据我所看到的内容（阅读可谷歌论坛的前几页，阅读基本文章并观看视频讲座）和下面的Maxim帖子，我开始认为答案是∂o/∂x= 0，我们将o视为常数。

Does anyone do anything special during back-prop through a NN with residual layers? 在带有残留层的NN反向传播过程中，有人做任何特别的事情吗？ If not, then does that mean residual layers are an 'active' part of the network on only the forward pass? 如果不是，那么这是否意味着残留层仅在前向通道上是网络的“活跃”部分？

1 个解决方案

I think you've over-complicated residual networks a little bit. 我认为您使残留网络有些复杂化了。 Here's the link to the original paper by Kaiming He at al. 这是何凯明等人的原始论文链接。

In section 3.2, they describe the "identity" shortcuts as y = F(x, W) + x , where W are the trainable parameters. 在第3.2节中，他们将“身份”快捷方式描述为y = F(x, W) + x ，其中W是可训练的参数。 You can see why it's called "identity": the value from the previous layer is added as is , without any complex transformation. 您可以看到为什么将其称为“身份”：上一层的值按原样添加，而无需进行任何复杂的转换。 This makes two things: 这有两件事：

F now learns the residual y - x (discussed in 3.1), in short: it's easier to learn. 简而言之， F现在学习残差y - x （在3.1中讨论）：它更容易学习。
The network gets an extra connection to the previous layer, which improves the gradients flow. 网络与上一层建立了额外的连接，从而改善了梯度流。

The backward flow through the identity mapping is trivial: the error message is passed unchanged, no inverse matrices are involved (in fact, they are not involved in any linear layer ). 通过标识映射的向后流动是微不足道的：错误消息不变地传递，不涉及逆矩阵（实际上，它们不涉及任何线性层）。

Now, paper authors go a bit further and consider a slightly more complicated version of F , which changes the output dimensions (which probably you had in mind). 现在，论文作者走得更远，考虑了F的稍微复杂一点的版本，它改变了输出尺寸（可能您已经想到了）。 They write it generally as y = F(x, W) + Ws * x , where Ws is the projection matrix . 他们通常将其写为y = F(x, W) + Ws * x ，其中Ws是投影矩阵 。 Note that, though it's written as matrix multiplication, this operation is in fact very simple: it adds extra zeros to x to make its shape larger. 请注意，尽管它被写为矩阵乘法，但实际上它非常简单：它将额外的零添加到x以使其形状更大。 You can read a discussion of this operation in this question . 您可以在这个问题中阅读有关此操作的讨论。 But this does very few changes the backward: the error message is simply clipped to the original shape of x . 但这并没有改变向后的变化：错误消息只是被裁剪为x的原始形状。