sklearn中的log_loss：标签二值化不支持多输出目标数据

Question

以下代码

from sklearn import metrics
import numpy as np
y_true = np.array([[0.2,0.8,0],[0.9,0.05,0.05]])
y_predict = np.array([[0.5,0.5,0.0],[0.5,0.4,0.1]])
metrics.log_loss(y_true, y_predict)

产生以下错误：

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-24beeb19448b> in <module>()
----> 1 metrics.log_loss(y_true, y_predict)

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\sklearn\metrics\classification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
   1646         lb.fit(labels)
   1647     else:
-> 1648         lb.fit(y_true)
   1649 
   1650     if len(lb.classes_) == 1:

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\sklearn\preprocessing\label.py in fit(self, y)
    276         self.y_type_ = type_of_target(y)
    277         if 'multioutput' in self.y_type_:
--> 278             raise ValueError("Multioutput target data is not supported with "
    279                              "label binarization")
    280         if _num_samples(y) == 0:

ValueError: Multioutput target data is not supported with label binarization

我很好奇为什么。 我试图重新读取日志丢失的定义，并且找不到任何会导致计算错误的内容。

Answer 1

源代码表示metrics.log_loss不支持的概率y_true 。 它仅支持形状的二进制指示符(n_samples, n_classes) ，例如[[0,0,1],[1,0,0]]或形状的类标签(n_samples,) ，例如[2, 0] 。 在后一种情况下，类标签将进行单热编码，在计算日志丢失之前看起来像指标矩阵。

在这个块中：

lb = LabelBinarizer()

if labels is not None:
    lb.fit(labels)
else:
    lb.fit(y_true)

您正在达到lb.fit(y_true) ，如果y_true不是全部1和/或0 ，则会失败。 例如：

>>> import numpy as np
>>> from sklearn import preprocessing

>>> lb = preprocessing.LabelBinarizer()

>>> lb.fit(np.array([[0,1,0],[1,0,0]]))

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

>>> lb.fit(np.array([[0.2,0.8,0],[0.9,0.05,0.05]]))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/imran/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 278, in fit
    raise ValueError("Multioutput target data is not supported with "
ValueError: Multioutput target data is not supported with label binarization

我会定义你自己的自定义日志丢失功能：

def logloss(y_true, y_pred, eps=1e-15):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -(y_true * np.log(y_pred)).sum(axis=1).mean()

这是您数据的输出：

>>> logloss(y_true, y_predict)
0.738961717153653

Answer 2

不，我不是在谈论二进制分类。

除非另有说明，否则您在上面显示的y_true和y_predict将不被视为分类目标。

首先，因为它们是概率所以它可以采用任何连续值，因此它被检测为scikit中的回归。

其次，y_pred或y_true中的每个元素都是概率列表。 这被检测为多输出。 因此“多输出目标”的错误。

您需要提供log_loss的实际标签，而不是y_true（Ground truths）的概率。 顺便说一句，为什么你有这样的概率？ 可能存在预测数据的概率，但实际数据的原因是什么？

为此，您需要首先将y_true的概率转换为标签，将最高概率视为胜利者类别。

这可以通过numpy.argmax使用以下代码完成：

import numpy as np
y_true = np.argmax(y_true, axis=1)

print(y_true)
Output:-  [0, 1]
# We will not do this the above for y_predict, because probabilities are allowed in it.

# We will use labels param to declare that we have actually 3 classes, 
# as evident from your probabilities.
metrics.log_loss(y_true, y_predict, labels=[0,1,2])

Output:-  0.6931471805599458

正如@Imran所讨论的，这是一个y_true值不是0或1的例子。

此示例使用log_loss进行3类分类，其中y的值为0,1和2： - http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_multiclass.html#sphx-glr-auto-examples -calibration积校准-多类-PY

下面的示例只是检查是否允许其他值：

y_true = np.array([0, 1, 2])
y_pred = np.array([[0.5,0.5,0.0],[0.5,0.4,0.1], [0.4,0.1,0.5]])
metrics.log_loss(y_true, y_pred)

Output:- 1.3040076684760489   (No error)

sklearn中的log_loss：标签二值化不支持多输出目标数据

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-01-29 16:48:31

解决方案2
1 2018-01-30 17:12:59

sklearn中的log_loss：标签二值化不支持多输出目标数据

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-01-29 16:48:31

解决方案2 1 2018-01-30 17:12:59

解决方案1
2 已采纳 2018-01-29 16:48:31

解决方案2
1 2018-01-30 17:12:59