[英]Why not use mean squared error for classification problems?
I am trying to solve a simple binary classification problem using LSTM.我正在尝试使用 LSTM 解决一个简单的二元分类问题。 I am trying to figure out the correct loss function for the network.
我试图找出网络的正确损失函数。 The issue is, when I use the binary cross-entropy as loss function, the loss value for training and testing is relatively high as compared to using the mean squared error (MSE) function.
问题是,当我使用二元交叉熵作为损失函数时,与使用均方误差 (MSE) 函数相比,训练和测试的损失值相对较高。
Upon research, I came across justifications that binary cross-entropy should be used for classification problems and MSE for the regression problem.经过研究,我发现二元交叉熵应该用于分类问题和 MSE 用于回归问题的理由。 However, in my case, I am getting better accuracies and lesser loss value with MSE for binary classification.
但是,就我而言,使用 MSE 进行二元分类时,我获得了更好的准确性和更小的损失值。
I am not sure how to justify these obtained results.我不确定如何证明这些获得的结果是合理的。 Why not use mean squared error for classification problems?
为什么不使用均方误差来解决分类问题?
I would like to show it using an example.我想用一个例子来展示它。 Assume a 6 class classification problem.
假设一个 6 类分类问题。
Assume, True probabilities = [1, 0, 0, 0, 0, 0]假设,真概率 = [1, 0, 0, 0, 0, 0]
Case 1: Predicted probabilities = [0.2, 0.16, 0.16, 0.16, 0.16, 0.16]案例 1:预测概率 = [0.2, 0.16, 0.16, 0.16, 0.16, 0.16]
Case 2: Predicted probabilities = [0.4, 0.5, 0.1, 0, 0, 0]案例 2:预测概率 = [0.4, 0.5, 0.1, 0, 0, 0]
The MSE in the Case1 and Case 2 is 0.128 and 0.1033 respectively.案例 1 和案例 2 中的 MSE 分别为0.128和0.1033 。
Although, Case 1 is correctly predicting class 1 for the instance, the loss in Case 1 is higher than the loss in Case 2.尽管案例 1 正确预测了实例的类别 1,但案例 1 中的损失高于案例 2 中的损失。
Though @nerd21 gives a good example for "MSE as loss function is bad for 6-class classification", it's not the same for binary classification.尽管@nerd21 为“MSE 作为损失函数对 6 类分类不利”提供了一个很好的例子,但对于二元分类来说却不尽相同。
Let's just consider binary classification.让我们只考虑二元分类。 Label is
[1, 0]
, one prediction is h1=[p, 1-p]
, another prediction is h2=[q, 1-q]
, thus their's MSEs are:标签是
[1, 0]
,一个预测是h1=[p, 1-p]
,另一个预测是h2=[q, 1-q]
,因此他们的 MSE 是:
L1 = 2*(1-p)^2, L2 = 2*(1-q)^2
Assuming h1 is mis-classifcation, ie p<1-p
, thus 0<p<0.5
Assuming h2 is correct-classification, ie q>1-q
, thus 0.5<q<1
Then L1-L2=2(pq)(p+q-2) > 0
is for sure: p < q
is for sure;假设 h1 是错误分类,即
p<1-p
,因此0<p<0.5
假设 h2 是正确分类,即q>1-q
,因此0.5<q<1
那么L1-L2=2(pq)(p+q-2) > 0
是肯定的: p < q
是肯定的; q + q < 1 + 0.5 < 1.5
, thus p + q - 2 < -0.5 < 0
; q + q < 1 + 0.5 < 1.5
,因此p + q - 2 < -0.5 < 0
; thus L1-L2>0
, ie L1 > L2
因此
L1-L2>0
,即L1 > L2
This mean for binary classfication with MSE as loss function, mis-classification will definitely with larger loss that correct-classification.这意味着对于以 MSE 作为损失函数的二元分类,误分类肯定会比正确分类损失更大。
The answer is right there in your question.答案就在您的问题中。 Value of binary cross entropy loss is higher than rmse loss.
二元交叉熵损失的值高于 rmse 损失。
Lets say your model predicted 1e-7 and the actual label is 1.假设您的模型预测为 1e-7,而实际标签为 1。
Binary Cross Entropy loss will be -log(1e-7) = 16.11 .二元交叉熵损失将为 -log(1e-7) = 16.11 。
Root mean square error will be (1-1e-7)^2 = 0.99 .均方根误差将为 (1-1e-7)^2 = 0.99 。
Lets say your model predicted 0.94 and the actual label is 1.假设您的模型预测为 0.94,而实际标签为 1。
Binary Cross Entropy loss will be -log(0.94) = 0.06 .二元交叉熵损失将为 -log(0.94) = 0.06 。
Root mean square error will be (1-1e-7)^2 = 0.06 .均方根误差将为 (1-1e-7)^2 = 0.06 。
In Case 1 when prediction is far off from reality, BCELoss has larger value compared to RMSE.在预测与现实相去甚远的情况 1 中,BCELoss 与 RMSE 相比具有更大的价值。 When you have large value of loss you'll have large value of gradients, thus optimizer will take a larger step in direction opposite to gradient.
当损失值较大时,梯度值也较大,因此优化器将在与梯度相反的方向上迈出更大的一步。 Which will result in relatively more reduction in loss.
这将导致相对更多的损失减少。
I'd like to share my understanding of the MSE and binary cross-entropy functions.我想分享我对 MSE 和二元交叉熵函数的理解。
In the case of classification, we take the argmax
of the probability of each training instance.在分类的情况下,我们取每个训练实例的概率的
argmax
。
Now, consider an example of a binary classifier where model predicts the probability as [0.49, 0.51]
.现在,考虑一个二元分类器的例子,其中模型预测概率为
[0.49, 0.51]
。 In this case, the model will return 1
as the prediction.在这种情况下,模型将返回
1
作为预测。
Now, assume that the actual label is also 1
.现在,假设实际标签也是
1
。
In such a case, if MSE is used, it will return 0
as a loss value, whereas the binary cross-entropy will return some "tangible" value.在这种情况下,如果使用 MSE,它将返回
0
作为损失值,而二元交叉熵将返回一些“有形”值。 And, if somehow with all data samples, the trained model predicts a similar type of probability, then binary cross-entropy effectively return a big accumulative loss value, whereas MSE will return a 0
.并且,如果以某种方式使用所有数据样本,训练模型预测了相似类型的概率,那么二元交叉熵有效地返回一个大的累积损失值,而 MSE 将返回一个
0
。
According to the MSE, it's a perfect model, but, actually, it's not that good model, that's why we should not use MSE for classification.根据MSE,这是一个完美的模型,但实际上,它不是那么好的模型,这就是为什么我们不应该使用MSE进行分类。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.