简体繁体 English

为什么 ReLU function 在 CNN 的每一层之后？

[英]Why ReLU function after every layer in CNN?

原文 2023-01-21 10:02:23 7 2 python/ machine-learning/ pytorch/ activation-function

I am taking intro to ML on Coursera offered by Duke, which I recommend if you are interested in ML.我正在介绍 Duke 提供的 Coursera 上的 ML，如果您对 ML 感兴趣，我推荐它。 The instructors of this course explained that "We typically include nonlinearities between layers of a neural.network.There's a number of reasons to do so.For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer. On the other hand, intermediate nonlinearities prevent this collapse, allowing neural.networks to approximate more complex functions."本课程的讲师解释说“我们通常在神经网络的层之间包含非线性。这样做有很多原因。首先，如果它们之间没有任何非线性，连续的线性变换（完全连接的层）会折叠成一个单一的线性变换，这意味着 model 并不比单层更具表现力。另一方面，中间非线性可防止这种崩溃，从而使神经网络能够逼近更复杂的函数。” I am curious that, if I apply ReLU, aren't we losing information since ReLU is transforming every negative value to 0?我很好奇，如果我应用 ReLU，我们是否会丢失信息，因为 ReLU 会将每个负值都转换为 0？ Then how is this transformation more expressive than that without ReLU?那么这个变换如何比没有 ReLU 的变换更具表现力呢？

In Multilayer Perceptron, I tried to run MLP on MNIST dataset without a ReLU transformation, and it seems that the performance didn't change much (92% with ReLU and 90% without ReLU).在多层感知器中，我尝试在没有 ReLU 转换的情况下在 MNIST 数据集上运行 MLP，并且性能似乎没有太大变化（使用 ReLU 为 92%，没有 ReLU 为 90%）。 But still, I am curious why this tranformation gives us more information rather than lose information.但是，我仍然很好奇为什么这种转变为我们提供了更多信息而不是丢失信息。

2 个解决方案

the first point is that without nonlinearities, such as the ReLU function, in a neural.network, the.network is limited to performing linear combinations of the input.第一点是，如果没有非线性，例如 ReLU function，在神经网络中，网络仅限于执行输入的线性组合。 In other words, the.network can only learn linear relationships between the input and output. This means that the.network can't approximate complex functions that are not linear, such as polynomials or non-linear equations.换句话说，.network只能学习输入和output之间的线性关系。这意味着.network不能逼近非线性的复杂函数，例如多项式或非线性方程。

Consider a simple example where the task is to classify a 2D data point as belonging to one of two classes based on its coordinates (x, y).考虑一个简单的示例，其中任务是根据坐标 (x, y) 将 2D 数据点分类为属于两个类之一。 A linear classifier, such as a single-layer perceptron, can only draw a straight line to separate the two classes.线性分类器，例如单层感知器，只能绘制一条直线来分隔两个类。 However, if the data points are not linearly separable, a linear classifier will not be able to classify them accurately.但是，如果数据点不是线性可分的，线性分类器将无法准确地对它们进行分类。 A nonlinear classifier, such as a multi-layer perceptron with a nonlinear activation function, can draw a curved decision boundary and separate the two classes more accurately.非线性分类器，例如具有非线性激活的多层感知器 function，可以绘制弯曲的决策边界并更准确地分离两个类。

ReLU function increases the complexity of the neural.network by introducing non-linearity, which allows the.network to learn more complex representations of the data. ReLU function 通过引入非线性增加了 neural.network 的复杂性，这使得 .network 可以学习更复杂的数据表示。 The ReLU function is defined as f(x) = max(0, x), which sets all negative values to zero. ReLU function 定义为 f(x) = max(0, x)，它将所有负值设置为零。 By setting all negative values to zero, the ReLU function creates multiple linear regions in the.network, which allows the.network to represent more complex functions.通过将所有负值设置为零，ReLU function 在.network 中创建了多个线性区域，这使得.network 可以表示更复杂的函数。

For example, suppose you have a neural.network with two layers, where the first layer has a linear activation function and the second layer has a ReLU activation function. The first layer can only perform a linear transformation on the input, while the second layer can perform a non-linear transformation.例如，假设您有一个包含两层的神经网络，其中第一层具有线性激活 function，第二层具有 ReLU 激活 function。第一层只能对输入执行线性变换，而第二层可以进行非线性变换。 By having a non-linear function in the second layer, the.network can learn more complex representations of the data.通过在第二层中使用非线性 function，网络可以学习更复杂的数据表示。

In the case of your experiment, it's normal that the performance did not change much when you removed the ReLU function, because the dataset and the problem you were trying to solve might not be complex enough to require a ReLU function. In other words, a linear model might be sufficient for that problem, but for more complex problems, ReLU can be a critical component to achieve good performance.在您的实验中，当您删除 ReLU function 时，性能没有太大变化是正常的，因为您尝试解决的数据集和问题可能不够复杂，不需要 ReLU function。换句话说，一个线性 model 可能足以解决该问题，但对于更复杂的问题，ReLU 可能是实现良好性能的关键组件。

It's also important to note that ReLU is not the only function to introduce non-linearity and other non-linear activation functions such as sigmoid and tanh could be used as well.还需要注意的是，ReLU 并不是唯一引入非线性的 function，也可以使用其他非线性激活函数，例如 sigmoid 和 tanh。 The choice of activation function depends on the problem and dataset you are working with.激活 function 的选择取决于您正在处理的问题和数据集。

Neural.networks are inspired by the structure of brain. Neural.networks 的灵感来自大脑的结构。 Neurons in the brain transmit information between different areas of the brain by using electrical impulses and chemical signals.大脑中的神经元通过使用电脉冲和化学信号在大脑的不同区域之间传递信息。 Some signals are strong and some are not.有些信号很强，有些则不是。 Neurons with weak signals are not activated.信号微弱的神经元不会被激活。

Neural.networks work in the same fashion. Neural.networks 以相同的方式工作。 Some input features have weak and some have strong signals.有些输入特征信号弱，有些信号强。 These depend on the features.这些取决于功能。 If they are weak, the related neurons aren't activated and don't transmit the information forward.如果它们很弱，相关的神经元就不会被激活，也不会向前传递信息。 We know that some features or inputs aren't crucial players in contributing to the label. For the same reason, we don't bother with feature engineering in neural.networks.我们知道某些特征或输入对于 label 的贡献并不是关键因素。出于同样的原因，我们不会为 neural.networks 中的特征工程而烦恼。 The model takes care of it. model 负责处理。 Thus, activation functions help here and tell the model which neurons and how much information they should transmit.因此，激活函数在这里提供帮助并告诉 model 哪些神经元以及它们应该传输多少信息。