简体   繁体   中英

Should my model always give 100% accuracy on Training dataset?

from sklearn.naive_bayes import MultinomialNB # Multinomial Naive Bayes on Lemmatized Text

X_train, X_test, y_train, y_test = train_test_split(df['Rejoined_Lemmatize'], df['Product'], random_state = 0)

X_train_counts = tfidf.fit_transform(X_train)
clf = MultinomialNB().fit(X_train_counts, y_train)
y_temp = clf.predict(tfidf.transform(X_train))

I am testing my model on the training dataset itself. It is giving me the following results:

                          precision    recall  f1-score   support

               accuracy                           0.92    742500
              macro avg       0.93      0.92      0.92    742500
           weighted avg       0.93      0.92      0.92    742500

Is it acceptable to get accuracy< 100% on the training dataset?

Nope, you shouldnot get 100% accuracy from your training dataset. If it does, it could mean that your model is overfitting.

TL:DR: yes it is accetable to have better performances on the testing dataset

The most important question in classification (supervised learning) is that of generalization, that is to say the performances in production (or on the testing dataset). Actually, the performances on your learning dataset do not matter since it is only used to learn your model. Once it is done, you will never use it, and the performances on only data that has not been seen during learning will be submited to the model.

A statistical model that is complex enough (that has enough capacity ) can perfectly fit to any learning dataset and obtain 100% accuracy on it. But by fitting perfectly to the training set, it will have poor performance on new data that are not seen during training ( overfitting ). Hence, it's not what interests you. Hence, you can accept to reduce the performances on the training dataset in order to better generalize, that is to say to get better performance on data that are not used during learning. This is named regularization .

In your case, I am nevertheless not sure that MultinomialNB allows to control the regularization. You should try other classifiers of sklearn such as proposed here .

I think it is better use the cross-validation result to see an accurate estimation of your accuracy. Cross-validation is taken to be an efficient way to avoid overfitting.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X_train, y_train, cv=10) 

And, you can report mean-score value: scores.mean() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM