简体   繁体   English

具有非线性激活 function(比如 ReLU)的神经网络可以用于线性分类任务吗?

[英]Can a neural network having non-linear activation function (say ReLU) be used for linear classification task?

I think the answer would be yes, but I'm unable to reason out a good explanation on this.我认为答案是肯定的,但我无法对此做出很好的解释。

Technically, yes .从技术上讲,的。

The reason you could use a non-linear activation function for this task is that you can manually alter the results.可以为此任务使用非线性激活 function 的原因是您可以手动更改结果。 Let's say the range the activation function outputs is between 0.0-1.0, then you can round up or down to get a binary 0/1, linear yes/no.假设激活 function 输出的范围在 0.0-1.0 之间,那么您可以向上或向下舍入以获得二进制 0/1,线性是/否。

The reason you shouldn't is the same reason that you shouldn't attach an industrial heater to a fan and call it a hair-drier, it's unnecessarily powerful and it could potentially waste resources and time.不应该这样做的原因与您不应该将工业加热器连接到风扇并称其为吹风机的原因相同,它不必要地强大并且可能会浪费资源和时间。

I hope this answer helped, have a good day!我希望这个答案对您有所帮助,祝您有美好的一天!

The mathematical argument lies in a power to represent linearity, we can use following three lemmas to show that:数学论证在于表示线性的能力,我们可以使用以下三个引理来证明:

Lemma 1引理 1

With affine transformations (linear layer) we can map the input hypercube [0,1]^d into arbitrary small box [a,b]^k.通过仿射变换(线性层),我们可以 map 将输入超立方体 [0,1]^d 转换为任意小框 [a,b]^k。 Proof is quite simple, we can just make all the biases to be equal to a, and make weights multiply by (ba).证明很简单,我们可以让所有的偏差都等于a,然后让权重乘以(ba)。

Lemma 2引理 2

For sufficiently small scale, many non-linearities are approximately linear.对于足够小的尺度,许多非线性是近似线性的。 This is actually very much a definition of a derivative, or, taylor expansion.这实际上是一个导数或泰勒展开式的定义。 In particular let us take relu(x), for x>0 it is in fact, linear?特别是让我们取 relu(x),对于 x>0 它实际上是线性的? What about sigmoid, Well if we look at a tiny tiny region [-eps, eps] you can see that it approaches a linear function as eps->0! sigmoid 怎么样,如果我们看一个很小的区域 [-eps, eps],你可以看到它接近线性 function 为 eps->0!

Lemma 3引理 3

Composition of affine functions is affine.仿射函数的组合是仿射的。 In other words, if I were to make a neural network with multiple linear layers, it is equivalent of having just one.换句话说,如果我要创建一个具有多个线性层的神经网络,它就相当于只有一个。 This comes from the matrix composition rules:这来自矩阵组合规则:

W2(W1x + b1) + b2 = W2W1x + W2b1 + b2 = (W2W1)x + (W2b1 + b2)
                                        ------    -----------
                                    New weights   New bias

Combining the above结合以上

Composing the three lemmas above we see that with a non-linear layer, there always exists an arbitrarily good approximation of the linear function, We simply use the first layer to map entire input space into the tiny part of the pre-activation spacve where your linearity is approximately linear.组合上面的三个引理,我们看到,对于非线性层,总是存在线性 function 的任意良好近似,我们只需使用第一层将 map 整个输入空间放入预激活空间的微小部分线性度近似线性。 and then we "map it back" in the following layer.然后我们在下一层“映射回来”。

General case一般情况

This is a very simple proof, now in general you can use Universal Approximation Theorem to show that a non-linear neural network (Sigmoid, Relu, many others) that is sufficiently large, can approximate any smooth target function, which includes linear ones.这是一个非常简单的证明,现在通常您可以使用通用逼近定理来证明足够大的非线性神经网络(Sigmoid、Relu 等)可以逼近任何平滑目标 function,其中包括线性目标。 This proof (originally given by Cybenko) is however much more complex and relies on showing that specific classes of functions are dense in the space of continuous functions.然而,这个证明(最初由 Cybenko 给出)要复杂得多,并且依赖于证明特定类的函数在连续函数空间中是密集的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM