简体   繁体   中英

Error when computing recall and precision and F-score of 4 models using cross validation?

This is my code which does a classification. I would like to print the accuracy, recall, and precision of my 4 models using cross-validation. I tried and failed because it always prints for a set of data and not the overall. Do you have any idea how to do it?

I would like to know if depending on my confusion matrix if it is possible to compare each model so that to print which one fail to predict the right label for each set. so @Nikaido, i tried your solution but the result of the precision, recall does not correspondant to the value i get when i computing them manually.



tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(verbatim_train_remove_stop_words_lemmatize)
X = tfidf_vectorizer.transform(verbatim_train_remove_stop_words_lemmatize)


total_verbatim = X.shape[0]
print(total_verbatim)
labels = np.zeros(total_verbatim);#creation de variable ; consulter les mal étiquettés +bien étiquettés
#error avec configuration avec l'ensemble 
labels[1:1315]=0; #motivations
labels[1316:1891]=1;#freins

df = pd.DataFrame(data={
    "id": [],
    "ground_true": [],
    "original_sentence": [],
    "pred_model1": []
   })

cv_splitter = KFold(n_splits=10, shuffle=False, random_state=None)
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression() #(random_state=0)
model4 = RandomForestClassifier()
models = [model1, model2, model3, model4]
for model in models:    
    verbatim_preprocess = np.array(verbatim_train_remove_stop_words_lemmatize)
    y_pred = cross_val_predict(model, X, labels, cv=cv_splitter)
    temp_df = pd.DataFrame.from_dict(data={"id": X,
                            "ground_true": labels,                              
                            "original_sentence": verbatim_preprocess,
                            "pred_model1": y_pred,
                            "pred_model2": y_pred,
                            "pred_model3": y_pred,
                            "pred_model4": y_pred
                            })
    df = pd.concat([df, temp_df])
    print("Model: {}".format(model))
    print("matrice confusion: {}".format(confusion_matrix(labels, y_pred)))
    print("Accuracy: {}".format(accuracy_score(labels, y_pred)))
    print("Precision: {}".format(precision_score(labels, y_pred)))
    print("Recall: {}".format(recall_score(labels, y_pred)))
    print("F mesure: {}".format(f1_score(labels, y_pred)))


df.to_excel("EXIT.xlsx")    








I get this result


Model: LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Accuracy: 0.5393971443680592
Precision: 0.13902439024390245
Recall: 0.09913043478260869
F mesure: 0.11573604060913706
matrice confusion: [[963 353]
 [518  57]]
Model: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Accuracy: 0.6604970914859862
Precision: 0.014492753623188406
Recall: 0.0017391304347826088
F mesure: 0.0031055900621118015
matrice confusion: [[1248   68]
 [ 574    1]]

if I ccompute manually the precision for the first model:

for svm: Precision: 963/963+353 = 0.73 Recall: 963/963+518 = 0,65

how? is my THE code wrong somewhere

Sklearn offers a lot of tools for the cross_validation estimation on different models. This task can be done in different ways. One I thinked of is:

from sklearn import datasets
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
# toy problem
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
cv_splitter = KFold(n_splits=10, shuffle=False, random_state=None)
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression() #(random_state=0)
model4 = RandomForestClassifier()
models = [model1, model2, model3, model4]
for model in models:
    y_pred = cross_val_predict(model, X, y, cv=cv_splitter)
    print("Accuracy: {}".format(accuracy_score(y, y_pred)))
    print("Precision: {}".format(precision_score(y, y_pred)))
    print("Recall: {}".format(recall_score(y, y_pred)))

Basically I used

  • a splitter for the CV split (Kfold), to have a fixed cv for every model.
  • the cross_val_predict method for the prediction on every label in the different test set of the cross validation split
  • and then for every prediction I extracted the specified metric (Accuracy, Precision, Recall)

Having the prediction labels for every model (y_pred in the for) then you can do the comparison that you need.

Details for cross_val_predict method here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM