使用交叉验证计算 4 个模型的召回率和精度以及 F 分数时出错？

Question

This is my code which does a classification.这是我的分类代码。 I would like to print the accuracy, recall, and precision of my 4 models using cross-validation.我想使用交叉验证打印我的 4 个模型的准确率、召回率和精度。 I tried and failed because it always prints for a set of data and not the overall.我尝试过但失败了，因为它总是打印一组数据而不是整体数据。 Do you have any idea how to do it?你知道怎么做吗？

I would like to know if depending on my confusion matrix if it is possible to compare each model so that to print which one fail to predict the right label for each set.我想知道是否可以根据我的混淆矩阵比较每个 model 以便打印哪一个无法预测每个集合的正确 label。 so @Nikaido, i tried your solution but the result of the precision, recall does not correspondant to the value i get when i computing them manually.所以@Nikaido，我尝试了你的解决方案，但精度的结果，召回与我手动计算时得到的值不对应。



tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(verbatim_train_remove_stop_words_lemmatize)
X = tfidf_vectorizer.transform(verbatim_train_remove_stop_words_lemmatize)


total_verbatim = X.shape[0]
print(total_verbatim)
labels = np.zeros(total_verbatim);#creation de variable ; consulter les mal étiquettés +bien étiquettés
#error avec configuration avec l'ensemble 
labels[1:1315]=0; #motivations
labels[1316:1891]=1;#freins

df = pd.DataFrame(data={
    "id": [],
    "ground_true": [],
    "original_sentence": [],
    "pred_model1": []
   })

cv_splitter = KFold(n_splits=10, shuffle=False, random_state=None)
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression() #(random_state=0)
model4 = RandomForestClassifier()
models = [model1, model2, model3, model4]
for model in models:    
    verbatim_preprocess = np.array(verbatim_train_remove_stop_words_lemmatize)
    y_pred = cross_val_predict(model, X, labels, cv=cv_splitter)
    temp_df = pd.DataFrame.from_dict(data={"id": X,
                            "ground_true": labels,                              
                            "original_sentence": verbatim_preprocess,
                            "pred_model1": y_pred,
                            "pred_model2": y_pred,
                            "pred_model3": y_pred,
                            "pred_model4": y_pred
                            })
    df = pd.concat([df, temp_df])
    print("Model: {}".format(model))
    print("matrice confusion: {}".format(confusion_matrix(labels, y_pred)))
    print("Accuracy: {}".format(accuracy_score(labels, y_pred)))
    print("Precision: {}".format(precision_score(labels, y_pred)))
    print("Recall: {}".format(recall_score(labels, y_pred)))
    print("F mesure: {}".format(f1_score(labels, y_pred)))


df.to_excel("EXIT.xlsx")

I get this result我得到这个结果


Model: LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Accuracy: 0.5393971443680592
Precision: 0.13902439024390245
Recall: 0.09913043478260869
F mesure: 0.11573604060913706
matrice confusion: [[963 353]
 [518  57]]
Model: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Accuracy: 0.6604970914859862
Precision: 0.014492753623188406
Recall: 0.0017391304347826088
F mesure: 0.0031055900621118015
matrice confusion: [[1248   68]
 [ 574    1]]

if I ccompute manually the precision for the first model:如果我手动计算第一个 model 的精度：

for svm: Precision: 963/963+353 = 0.73 Recall: 963/963+518 = 0,65对于 svm：精度：963/963+353 = 0.73 召回率：963/963+518 = 0,65

how?如何？ is my THE code wrong somewhere我的代码错在某处吗

Answer 1

Sklearn offers a lot of tools for the cross_validation estimation on different models. Sklearn 为不同模型的 cross_validation 估计提供了很多工具。 This task can be done in different ways.该任务可以通过不同的方式完成。 One I thinked of is:我想到的一个是：

from sklearn import datasets
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
# toy problem
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
cv_splitter = KFold(n_splits=10, shuffle=False, random_state=None)
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression() #(random_state=0)
model4 = RandomForestClassifier()
models = [model1, model2, model3, model4]
for model in models:
    y_pred = cross_val_predict(model, X, y, cv=cv_splitter)
    print("Accuracy: {}".format(accuracy_score(y, y_pred)))
    print("Precision: {}".format(precision_score(y, y_pred)))
    print("Recall: {}".format(recall_score(y, y_pred)))

Basically I used基本上我用过

a splitter for the CV split (Kfold), to have a fixed cv for every model.用于 CV 拆分 (Kfold) 的拆分器，为每个 model 提供一个固定的 cv。
the cross_val_predict method for the prediction on every label in the different test set of the cross validation split用于在交叉验证拆分的不同测试集中对每个 label 进行预测的 cross_val_predict 方法
and then for every prediction I extracted the specified metric (Accuracy, Precision, Recall)然后对于每个预测，我都提取了指定的指标（准确度、精确度、召回率）

Having the prediction labels for every model (y_pred in the for) then you can do the comparison that you need.拥有每个 model 的预测标签（for 中的 y_pred），然后您可以进行所需的比较。

Details for cross_val_predict method here cross_val_predict方法的详细信息在这里

使用交叉验证计算 4 个模型的召回率和精度以及 F 分数时出错？

问题描述

1 个解决方案

解决方案1
0 2019-11-08 10:36:39

使用交叉验证计算 4 个模型的召回率和精度以及 F 分数时出错？

问题描述

1 个解决方案

解决方案1 0 2019-11-08 10:36:39

解决方案1
0 2019-11-08 10:36:39