简体   繁体   English

无论我的训练集有多小,测试准确度始终很高

[英]Test accuracy always high regardless of how small my training set is

I am doing a project where I am trying to classify comments into various categories: "toxic","severe_toxic","obscene","insult","identity_hate".我正在做一个项目,我试图将评论分为各种类别:“有毒”、“严重有毒”、“淫秽”、“侮辱”、“身份仇恨”。 The dataset I am using is from this Kaggle challenge: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge .我使用的数据集来自这个 Kaggle 挑战: https ://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge。 The current issue I am facing is that no matter how small a training dataset I fit my data on, when I predict labels for the test data, my accuracy is always around or above 90%.我目前面临的问题是,无论我拟合数据的训练数据集有多小,当我预测测试数据的标签时,我的准确率始终在 90% 左右或以上。 In this case I am training on 15 rows of data and testing on 159,556 rows.在这种情况下,我正在训练 15 行数据并测试 159,556 行。 I would normally be excited to have a high testing accuracy, but in this case, I feel like I am doing something wrong.我通常会很高兴获得高测试准确性,但在这种情况下,我觉得我做错了什么。

I am reading the data into a pandas dataframe:我正在将数据读入熊猫数据框:

trainData = pd.read_csv('train.csv')

Here is what the data looks like when printed:这是打印时数据的样子:

                      id                                       comment_text  \
0       0000997932d777bf  Explanation\nWhy the edits made under my usern...   
1       000103f0d9cfb60f  D'aww! He matches this background colour I'm s...   
2       000113f07ec002fd  Hey man, I'm really not trying to edit war. It...   
3       0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...   
4       0001d958c54c6e35  You, sir, are my hero. Any chance you remember...   
...                  ...                                                ...   
159566  ffe987279560d7ff  ":::::And for the second time of asking, when ...   
159567  ffea4adeee384e90  You should be ashamed of yourself \n\nThat is ...   
159568  ffee36eab5c267c9  Spitzer \n\nUmm, theres no actual article for ...   
159569  fff125370e4aaaf3  And it looks like it was actually you who put ...   
159570  fff46fc426af1f9a  "\nAnd ... I really don't think you understand...   

        toxic  severe_toxic  obscene  threat  insult  identity_hate  
0           0             0        0       0       0              0  
1           0             0        0       0       0              0  
2           0             0        0       0       0              0  
3           0             0        0       0       0              0  
4           0             0        0       0       0              0  
...       ...           ...      ...     ...     ...            ...  
159566      0             0        0       0       0              0  
159567      0             0        0       0       0              0  
159568      0             0        0       0       0              0  
159569      0             0        0       0       0              0  
159570      0             0        0       0       0              0  

[159571 rows x 8 columns]

Then I split the data into train and test, using train_test_split:然后我使用 train_test_split 将数据拆分为训练和测试:

X = trainData.drop(labels= ['id','toxic','severe_toxic','obscene','threat','insult','identity_hate'],axis=1)
Y = trainData.drop(labels = ['id','comment_text'],axis=1)

trainX,testX,trainY,testY = train_test_split(X,Y,test_size=0.9999,random_state=99)

I am using sklearn's HashingVectorizer to convert the comments into numerical vectors for classifying:我正在使用 sklearn 的 HashingVectorizer 将评论转换为数值向量以进行分类:

def hashVec():
    trainComments=[]
    testComments=[]
    for index,row in trainX.iterrows():
        trainComments.append(row['comment_text'])
    for index,row in testX.iterrows():
        testComments.append(row['comment_text'])
    vectorizer = HashingVectorizer()
    trainSamples = vectorizer.transform(trainComments)
    testSamples = vectorizer.transform(testComments)
    return trainSamples,testSamples

I am using OneVsRestClassifier and LogisticRegression from sklearn to fit and predict data for each of the 6 classes我正在使用 sklearn 的 OneVsRestClassifier 和 LogisticRegression 来拟合和预测 6 个类中的每一个的数据

def logRegOVR(trainSamples,testSamples):
    commentTypes=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
    clf = OneVsRestClassifier(LogisticRegression(solver='sag'))
    for cType in commentTypes:
        print(cType,":")
        clf.fit(trainSamples,trainY[cType])
        pred1 = clf.predict(trainSamples)
        print("\tTrain Accuracy:",accuracy_score(trainY[cType],pred1))
        prediction = clf.predict(testSamples)
        print("\tTest Accuracy:",accuracy_score(testY[cType],prediction))

Finally, here is where I call the functions, and the output I get:最后,这是我调用函数的地方,以及我得到的输出:

sol = hashVec()
logRegOVR(sol[0],sol[1])
toxic :
    Train Accuracy: 0.8666666666666667
    Test Accuracy: 0.9041590413397177
severe_toxic :
    Train Accuracy: 1.0
    Test Accuracy: 0.9900035097395272
obscene :
    Train Accuracy: 1.0
    Test Accuracy: 0.9470468048835519
threat :
    Train Accuracy: 1.0
    Test Accuracy: 0.9970041866178646
insult :
    Train Accuracy: 1.0
    Test Accuracy: 0.9506317531148938
identity_hate :
    Train Accuracy: 1.0
    Test Accuracy: 0.9911943142219659

The testing accuracy is very similar when I have a more reasonable train_test_split of 80% training and 20% testing.当我有一个更合理的 80% 训练和 20% 测试的 train_test_split 时,测试精度非常相似。

Thank you for the assistance感谢您的帮助

You are not using a good metric : accuracy is not a good way to determine if you're doing right.您没有使用好的指标:准确性不是确定您是否做得对的好方法。 I recommend you to look at what we call F1 score which is a mix between precision and recall and I found it more relevant to evaluate how my classifier is working我建议您查看我们所说的 F1 分数,它是精度和召回率的混合体,我发现它与评估我的分类器的工作方式更相关

If it is an unbalanced dataset, accuracy does not mean a thing.如果它是一个不平衡的数据集,准确性并不意味着什么。 If 90% of your dataset comments do not fall into any of those 'toxic' categories, and the model always predicts that a comment is 'clean', you still have your 90% accuracy.如果 90% 的数据集评论不属于这些“有毒”类别中的任何一个,并且模型始终预测评论是“干净的”,那么您仍然拥有 90% 的准确率。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 训练和验证的准确性很高,但测试集的准确性却很低 - High accuracy on both training and validation but very low on test set 高精度训练但低精度测试/预测 - High accuracy training but low accuracy test/prediction LSTM 在训练时具有很高的准确性,但是,当我在训练集上对其进行测试时,它似乎做得很差 - LSTM has high accuracy while training, however, when I test it on the training set it seems to do very poorly Keras 1D CNN 总是预测相同的结果,即使训练集的准确度很高 - Keras 1D CNN always predicts the same result even if accuracy is high on training set 部署 CNN:训练和测试准确度高,但预测准确度低 - Deploying a CNN: High training and test accuracy but low prediction accuracy 迭代许多随机训练和测试集分割直到达到高精度是不好的做法吗? - Is it bad practice to iterate over many random training & test set splits until a high accuracy is achieved? Keras - 绘制训练、验证和测试集准确性 - Keras - Plot training, validation and test set accuracy CNN模型的准确性对于培训和验证集永远不会很高 - Accuracy in a CNN model never goes high for training and validation set 随机森林:如何获得训练准确度以与测试准确度进行比较 - Random Forest: How to get the training accuracy for comparison to test accuracy 验证准确率非常低,但训练准确率很高 - Very low validation accuracy but high training accuracy
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM