如何解释和研究不平衡数据中的完美准确性，精确度，召回率，F1和AUC（我不信任）

Question

I have a largely imbalanced multi-labeled dataset. 我有一个高度不平衡的多标签数据集。

Something unexpected came up in the results. 结果出乎意料。 As expected, using logistic regression classifier, labels having higher frequency achieved reasonable f1-score and auc-score (ie: 0.6-0.7), and those labels with less than 10% representation in the data expectedly got 0 for f-1 and 0.5 for auc-score. 如预期的那样，使用逻辑回归分类器，具有较高频率的标签达到了合理的f1得分和auc得分（即：0.6-0.7），并且那些在数据中表示量少于10％的标签对于f-1和0预期为0用于auc评分。

But when I run the same thing with SVC and Naive Bayes classifiers, some of these lower-frequency labels (for example: out of the 7000 samples, a minor class may have 10 samples) showed 100% accuracy, f-1, precision, recall, and auc-score, which I don't understand. 但是，当我对SVC和Naive Bayes分类器运行相同的代码时，其中一些低频标签（例如：在7000个样本中，次要类别可能有10个样本）显示出100％的精度，f-1，精度，回想一下，还有auc得分，我听不懂。 I don't trust these perfect results given such low training sample available. 由于培训样本太少，我不相信这些完美的结果。 I also tried different random seed to split the training and test sets, and received the same results. 我还尝试了其他随机种子来拆分训练集和测试集，并获得了相同的结果。

Classifiers 分类器

Logistic regression classifier
Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ..._state=None, solver='sag', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1))])

Naive Bayes classifier
Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...assifier(estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
          n_jobs=1))])

SVC classifier
Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...lti_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=1))])

Answer 1

To me, your result seems at least credible. 对我来说，您的结果似乎至少是可信的。 Logistic regression tends toward a median characterization of the data, finding a single equation to characterize the differences among classes. Logistic回归趋向于对数据进行中值表征，找到一个方程式来表征类别之间的差异。 Given the non-trivial quantity of data, it looks for the least error fit for that equation. 给定非平凡的数据量，它将寻找该方程式的最小误差拟合。

SVC and Bayes are much more sensitive to discernible boundaries, even when far from the "mainstream of data. Those algorithms work more on the "us against the world" (aka "one versus all") view of each class. Thus, it doesn't surprise me that they can find a reasonable way to discriminate between a set of ten elements and "everything else". 即使远离“数据主流”，SVC和贝叶斯对可识别的边界也更加敏感。这些算法在每个类的“我们与世界对抗”（又名“一对一”）视图上的工作更多。他们可以找到一种合理的方法来区分一组十个要素和“其他所有要素”，这不足为奇。

Can you find a useful visualization tool to display the boundaries found by each method? 您是否可以找到有用的可视化工具来显示每种方法找到的边界？ If not, can you at least visualize the data set, with observations color-coded? 如果不是，您是否至少可以使用颜色标记的观察结果来可视化数据集？ If you can see a distinct separation for a set of ten points, then I would expect SVC or Naive Bayes to find something comparable. 如果您看到一组十点的明显分离，那么我希望 SVC或Naive Bayes可以找到可比的东西。

Answer 2

Did you check on how many samples these metrics were calculated? 您检查过这些指标计算了多少个样本吗？ If there were, eg only two samples for testing 100% is not that odd, given the low number of testing samples. 如果存在，例如，考虑到测试样品的数量很少，那么仅两个样品进行100％的测试就不那么奇怪了。

Additionally, since you have imbalanced data did you consider measures like the balanced accuracy or Mathews correlation coefficient (MCC) to gain insight into the predictive performance? 此外，由于您的数据不平衡，您是否考虑过采用诸如平衡精度或Mathews相关系数（MCC）之类的方法来深入了解预测性能？ Models can have a very high AUC while disregarding the minority class completely. 模型可以具有很高的AUC，而完全忽略少数派。 If this also coincides with eg only majority class samples in the test set that can also lead to these unexpected results. 如果这也与例如测试集中的多数类样本相吻合，那么这也可能导致这些意外结果。

如何解释和研究不平衡数据中的完美准确性，精确度，召回率，F1和AUC（我不信任）

问题描述

2 个解决方案

解决方案1
0 2018-06-15 17:39:17

解决方案2
0 2019-02-08 15:28:13

如何解释和研究不平衡数据中的完美准确性，精确度，召回率，F1和AUC（我不信任）

问题描述

2 个解决方案

解决方案1 0 2018-06-15 17:39:17

解决方案2 0 2019-02-08 15:28:13

解决方案1
0 2018-06-15 17:39:17

解决方案2
0 2019-02-08 15:28:13