为什么不使用均方误差来解决分类问题？

Question

I am trying to solve a simple binary classification problem using LSTM.我正在尝试使用 LSTM 解决一个简单的二元分类问题。 I am trying to figure out the correct loss function for the network.我试图找出网络的正确损失函数。 The issue is, when I use the binary cross-entropy as loss function, the loss value for training and testing is relatively high as compared to using the mean squared error (MSE) function.问题是，当我使用二元交叉熵作为损失函数时，与使用均方误差 (MSE) 函数相比，训练和测试的损失值相对较高。

Upon research, I came across justifications that binary cross-entropy should be used for classification problems and MSE for the regression problem.经过研究，我发现二元交叉熵应该用于分类问题和 MSE 用于回归问题的理由。 However, in my case, I am getting better accuracies and lesser loss value with MSE for binary classification.但是，就我而言，使用 MSE 进行二元分类时，我获得了更好的准确性和更小的损失值。

I am not sure how to justify these obtained results.我不确定如何证明这些获得的结果是合理的。 Why not use mean squared error for classification problems?为什么不使用均方误差来解决分类问题？

Answer 1

I would like to show it using an example.我想用一个例子来展示它。 Assume a 6 class classification problem.假设一个 6 类分类问题。

Assume, True probabilities = [1, 0, 0, 0, 0, 0]假设，真概率 = [1, 0, 0, 0, 0, 0]

Case 1: Predicted probabilities = [0.2, 0.16, 0.16, 0.16, 0.16, 0.16]案例 1：预测概率 = [0.2, 0.16, 0.16, 0.16, 0.16, 0.16]

Case 2: Predicted probabilities = [0.4, 0.5, 0.1, 0, 0, 0]案例 2：预测概率 = [0.4, 0.5, 0.1, 0, 0, 0]

The MSE in the Case1 and Case 2 is 0.128 and 0.1033 respectively.案例 1 和案例 2 中的 MSE 分别为0.128和0.1033 。

Although, Case 1 is correctly predicting class 1 for the instance, the loss in Case 1 is higher than the loss in Case 2.尽管案例 1 正确预测了实例的类别 1，但案例 1 中的损失高于案例 2 中的损失。

Answer 2

Though @nerd21 gives a good example for "MSE as loss function is bad for 6-class classification", it's not the same for binary classification.尽管@nerd21 为“MSE 作为损失函数对 6 类分类不利”提供了一个很好的例子，但对于二元分类来说却不尽相同。

Let's just consider binary classification.让我们只考虑二元分类。 Label is [1, 0] , one prediction is h1=[p, 1-p] , another prediction is h2=[q, 1-q] , thus their's MSEs are:标签是[1, 0] ，一个预测是h1=[p, 1-p] ，另一个预测是h2=[q, 1-q] ，因此他们的 MSE 是：

L1 = 2*(1-p)^2, L2 = 2*(1-q)^2

Assuming h1 is mis-classifcation, ie p<1-p , thus 0<p<0.5 Assuming h2 is correct-classification, ie q>1-q , thus 0.5<q<1 Then L1-L2=2(pq)(p+q-2) > 0 is for sure: p < q is for sure;假设 h1 是错误分类，即p<1-p ，因此0<p<0.5假设 h2 是正确分类，即q>1-q ，因此0.5<q<1那么L1-L2=2(pq)(p+q-2) > 0是肯定的： p < q是肯定的； q + q < 1 + 0.5 < 1.5 , thus p + q - 2 < -0.5 < 0 ; q + q < 1 + 0.5 < 1.5 ，因此p + q - 2 < -0.5 < 0 ； thus L1-L2>0 , ie L1 > L2因此L1-L2>0 ，即L1 > L2

This mean for binary classfication with MSE as loss function, mis-classification will definitely with larger loss that correct-classification.这意味着对于以 MSE 作为损失函数的二元分类，误分类肯定会比正确分类损失更大。

Answer 3

The answer is right there in your question.答案就在您的问题中。 Value of binary cross entropy loss is higher than rmse loss.二元交叉熵损失的值高于 rmse 损失。

Case 1 (Large Error):案例 1（大错误）：

Lets say your model predicted 1e-7 and the actual label is 1.假设您的模型预测为 1e-7，而实际标签为 1。

Binary Cross Entropy loss will be -log(1e-7) = 16.11 .二元交叉熵损失将为 -log(1e-7) = 16.11 。

Root mean square error will be (1-1e-7)^2 = 0.99 .均方根误差将为 (1-1e-7)^2 = 0.99 。

Case 2 (Small Error)案例2（小错误）

Lets say your model predicted 0.94 and the actual label is 1.假设您的模型预测为 0.94，而实际标签为 1。

Binary Cross Entropy loss will be -log(0.94) = 0.06 .二元交叉熵损失将为 -log(0.94) = 0.06 。

Root mean square error will be (1-1e-7)^2 = 0.06 .均方根误差将为 (1-1e-7)^2 = 0.06 。

In Case 1 when prediction is far off from reality, BCELoss has larger value compared to RMSE.在预测与现实相去甚远的情况 1 中，BCELoss 与 RMSE 相比具有更大的价值。 When you have large value of loss you'll have large value of gradients, thus optimizer will take a larger step in direction opposite to gradient.当损失值较大时，梯度值也较大，因此优化器将在与梯度相反的方向上迈出更大的一步。 Which will result in relatively more reduction in loss.这将导致相对更多的损失减少。

Answer 4

I'd like to share my understanding of the MSE and binary cross-entropy functions.我想分享我对 MSE 和二元交叉熵函数的理解。

In the case of classification, we take the argmax of the probability of each training instance.在分类的情况下，我们取每个训练实例的概率的argmax 。

Now, consider an example of a binary classifier where model predicts the probability as [0.49, 0.51] .现在，考虑一个二元分类器的例子，其中模型预测概率为[0.49, 0.51] 。 In this case, the model will return 1 as the prediction.在这种情况下，模型将返回1作为预测。

Now, assume that the actual label is also 1 .现在，假设实际标签也是1 。

In such a case, if MSE is used, it will return 0 as a loss value, whereas the binary cross-entropy will return some "tangible" value.在这种情况下，如果使用 MSE，它将返回0作为损失值，而二元交叉熵将返回一些“有形”值。 And, if somehow with all data samples, the trained model predicts a similar type of probability, then binary cross-entropy effectively return a big accumulative loss value, whereas MSE will return a 0 .并且，如果以某种方式使用所有数据样本，训练模型预测了相似类型的概率，那么二元交叉熵有效地返回一个大的累积损失值，而 MSE 将返回一个0 。

According to the MSE, it's a perfect model, but, actually, it's not that good model, that's why we should not use MSE for classification.根据MSE，这是一个完美的模型，但实际上，它不是那么好的模型，这就是为什么我们不应该使用MSE进行分类。

为什么不使用均方误差来解决分类问题？

问题描述

4 个解决方案

解决方案1
5 2019-11-17 18:41:16

解决方案2
1 2020-07-05 04:02:59

解决方案3
1 2021-05-03 05:00:06

Case 1 (Large Error):案例 1（大错误）：

Case 2 (Small Error)案例2（小错误）

解决方案4
-1 已采纳 2019-05-08 16:36:50

为什么不使用均方误差来解决分类问题？

问题描述

4 个解决方案

解决方案1 5 2019-11-17 18:41:16

解决方案2 1 2020-07-05 04:02:59

解决方案3 1 2021-05-03 05:00:06

Case 1 (Large Error):案例 1（大错误）：

Case 2 (Small Error)案例2（小错误）

解决方案4 -1 已采纳 2019-05-08 16:36:50

解决方案1
5 2019-11-17 18:41:16

解决方案2
1 2020-07-05 04:02:59

解决方案3
1 2021-05-03 05:00:06

解决方案4
-1 已采纳 2019-05-08 16:36:50