简体繁体 English

神经网络中残差连接有什么用？

[英]What's the use of residual connections in neural networks?

原文 2022-06-14 16:39:03 5 1 machine-learning/ neural-network/ transformer/ self-attention

I've recently been learning about self-attention transformers and the "Attention is All You Need" paper.我最近一直在学习 self-attention 转换器和“Attention is All You Need”论文。 When describing the architecture of the neural network used in the paper, one breakdown of the paper included this explanation for residual connections:在描述论文中使用的神经网络的架构时，论文的一个细分包括对残差连接的解释：

"Residual layer connections are used (of course) in both encoder and decoder blocks" (origin: https://www.kaggle.com/code/residentmario/transformer-architecture-self-attention/notebook ) “（当然）在编码器和解码器块中都使用了剩余层连接”（来源： https ://www.kaggle.com/code/residentmario/transformer-architecture-self-attention/notebook）

This was, unfortunately, not obvious to me.不幸的是，这对我来说并不明显。 What is the purpose of residual connections, and why should this be standard practice?残差连接的目的是什么，为什么这应该是标准做法？

1 个解决方案

There is nothing "obvious" about skip connections, it is something that as a community we learned the hard way.跳过连接并没有什么“显而易见的”，作为一个社区，我们通过艰难的方式学到了这一点。 The basic premise is that in neural network parametrisation of feed forward layers, it is surprisingly hard to learn identify function.基本前提是，在前馈层的神经网络参数化中，学习识别函数非常困难。 Skip connections make this special function ( f(x)=x ) extremely easy to learn, which improves network learning stability, and overall performance in a wide range of applications, at pretty much no extra computational cost.跳过连接使这个特殊函数 ( f(x)=x ) 非常容易学习，从而提高了网络学习稳定性和广泛应用中的整体性能，几乎没有额外的计算成本。 You are essentially giving a network an easy way of not using convoluted, comlpex part of computation when it does not need to, and thus allow us to use complex and big architectures without in depth understanding of the dynamics of the problem (which are beyond our current understanding of math!).您实际上是在为网络提供一种简单的方法，在不需要时不使用计算的复杂、复杂的部分，从而允许我们使用复杂和大型架构，而无需深入了解问题的动态（这超出了我们的当前对数学的理解！）。

You can look at old-ish papers like highway networks showing how it allows to train very deep models that otherwise would be to ill-conditioned to trian.您可以查看诸如高速公路网络之类的旧论文，这些论文展示了它如何允许训练非常深的模型，否则这些模型将不适用于 trian。