完美的精度，召回率和f1得分，但預測不佳

Question

使用scikit-learn對二進制問題進行分類。 獲得完美的classification_report （全1）。 但預測得出0.36 。 怎么可能？

我熟悉標簽不平衡的情況。 但是我不認為這是事實，因為f1和其他分數列以及混淆矩陣表示完美分數。

# Set aside the last 19 rows for prediction.
X1, X_Pred, y1, y_Pred = train_test_split(X, y, test_size= 19, 
                shuffle = False, random_state=None)

X_train, X_test, y_train, y_test = train_test_split(X1, y1, 
         test_size= 0.4, stratify = y1, random_state=11)

clcv = DecisionTreeClassifier()
scorecv = cross_val_score(clcv, X1, y1, cv=StratifiedKFold(n_splits=4), 
                         scoring= 'f1') # to balance precision/recall
clcv.fit(X1, y1)
y_predict = clcv.predict(X1)
cm = confusion_matrix(y1, y_predict)
cm_df = pd.DataFrame(cm, index = ['0','1'], columns = ['0','1'] )
print(cm_df)
print(classification_report( y1, y_predict ))
print('Prediction score:', clcv.score(X_Pred, y_Pred)) # unseen data

輸出：

confusion:
      0   1
0  3011   0
1     0  44

              precision    recall  f1-score   support
       False       1.00      1.00      1.00      3011
        True       1.00      1.00      1.00        44

   micro avg       1.00      1.00      1.00      3055
   macro avg       1.00      1.00      1.00      3055
weighted avg       1.00      1.00      1.00      3055

Prediction score: 0.36

Answer 1

問題是您過度擬合。

有很多未使用的代碼，所以讓我們修剪一下：

# Set aside the last 19 rows for prediction.
X1, X_Pred, y1, y_Pred = train_test_split(X, y, test_size= 19, 
                shuffle = False, random_state=None)

clcv = DecisionTreeClassifier()
clcv.fit(X1, y1)
y_predict = clcv.predict(X1)
cm = confusion_matrix(y1, y_Pred)
cm_df = pd.DataFrame(cm, index = ['0','1'], columns = ['0','1'] )
print(cm_df)
print(classification_report( y1, y_Pred ))
print('Prediction score:', clcv.score(X_Pred, y_Pred)) # unseen data

顯然，這里沒有交叉驗證，而較低的預測分數的明顯原因是決策樹分類器的過度擬合。

使用交叉驗證中的分數，您應該在那里直接看到問題。

完美的精度，召回率和f1得分，但預測不佳

問題描述

1 個解決方案

解決方案1
2 已采納 2018-11-13 11:59:40

完美的精度，召回率和f1得分，但預測不佳

問題描述

1 個解決方案

解決方案1 2 已采納 2018-11-13 11:59:40

解決方案1
2 已采納 2018-11-13 11:59:40