我的 model 是否应该始终在训练数据集上提供 100% 的准确度？

Question

from sklearn.naive_bayes import MultinomialNB # Multinomial Naive Bayes on Lemmatized Text

X_train, X_test, y_train, y_test = train_test_split(df['Rejoined_Lemmatize'], df['Product'], random_state = 0)

X_train_counts = tfidf.fit_transform(X_train)
clf = MultinomialNB().fit(X_train_counts, y_train)
y_temp = clf.predict(tfidf.transform(X_train))

I am testing my model on the training dataset itself.我正在训练数据集本身上测试我的 model。 It is giving me the following results:它给了我以下结果：

                          precision    recall  f1-score   support

               accuracy                           0.92    742500
              macro avg       0.93      0.92      0.92    742500
           weighted avg       0.93      0.92      0.92    742500

Is it acceptable to get accuracy< 100% on the training dataset?在训练数据集上获得 < 100% 的准确率是否可以接受？

Answer 1

Nope, you shouldnot get 100% accuracy from your training dataset.不，您不应该从训练数据集中获得 100% 的准确率。 If it does, it could mean that your model is overfitting.如果是这样，则可能意味着您的 model 过拟合。

Answer 2

TL:DR: yes it is accetable to have better performances on the testing dataset TL:DR: 是的，在测试数据集上有更好的表现是可以接受的

The most important question in classification (supervised learning) is that of generalization, that is to say the performances in production (or on the testing dataset).分类（监督学习）中最重要的问题是泛化问题，即生产（或测试数据集）中的性能。 Actually, the performances on your learning dataset do not matter since it is only used to learn your model.实际上，您的学习数据集的性能并不重要，因为它仅用于学习您的 model。 Once it is done, you will never use it, and the performances on only data that has not been seen during learning will be submited to the model.一旦完成，您将永远不会使用它，并且只会将在学习过程中没有看到的数据上的表现提交给 model。

A statistical model that is complex enough (that has enough capacity ) can perfectly fit to any learning dataset and obtain 100% accuracy on it.足够复杂（具有足够容量）的统计 model 可以完美地拟合任何学习数据集并获得 100% 的准确率。 But by fitting perfectly to the training set, it will have poor performance on new data that are not seen during training ( overfitting ).但是通过完美地拟合训练集，它将在训练期间看不到的新数据上表现不佳（过度拟合）。 Hence, it's not what interests you.因此，这不是你感兴趣的。 Hence, you can accept to reduce the performances on the training dataset in order to better generalize, that is to say to get better performance on data that are not used during learning.因此，您可以接受降低训练数据集的性能以更好地泛化，即在学习期间未使用的数据上获得更好的性能。 This is named regularization .这称为正则化。

In your case, I am nevertheless not sure that MultinomialNB allows to control the regularization.在您的情况下，我仍然不确定MultinomialNB是否允许控制正则化。 You should try other classifiers of sklearn such as proposed here .您应该尝试其他的 sklearn 分类器，例如这里提出的。

Answer 3

I think it is better use the cross-validation result to see an accurate estimation of your accuracy.我认为最好使用交叉验证结果来准确估计您的准确性。 Cross-validation is taken to be an efficient way to avoid overfitting.交叉验证被认为是避免过度拟合的有效方法。

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X_train, y_train, cv=10)

And, you can report mean-score value: scores.mean() .而且，您可以报告平均分值： scores.mean() 。

我的 model 是否应该始终在训练数据集上提供 100% 的准确度？

问题描述

3 个解决方案

解决方案1
4 2020-06-04 03:49:50

解决方案2
2 2020-06-04 07:35:08

解决方案3
2 2020-06-04 09:06:56

我的 model 是否应该始终在训练数据集上提供 100% 的准确度？

问题描述

3 个解决方案

解决方案1 4 2020-06-04 03:49:50

解决方案2 2 2020-06-04 07:35:08

解决方案3 2 2020-06-04 09:06:56

解决方案1
4 2020-06-04 03:49:50

解决方案2
2 2020-06-04 07:35:08

解决方案3
2 2020-06-04 09:06:56