简体   繁体   English

用scikit理解accuracy_score - 用我自己的语料库学习?

[英]Understanding accuracy_score with scikit-learn with my own corpus?

Suppose that i all ready do some text classification with scikit learn with SVC . 假设我已准备好用scikit进行一些文本分类学习SVC First i vectorized the corpus, i split the data into test and train sets and then i set up the labels into the train set. 首先我将语料库矢量化,我将数据分成测试和训练集,然后我将标签设置到火车组中。 Now i would like to obtain the accuracy of the classification. 现在我想获得分类的准确性。

From the documentation i read the following: 文档中我读了以下内容:

>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2

The problem is i dont understand what are: y_pred = [0, 2, 1, 3] and y_true = [0, 1, 2, 3] and how can i "reach" or obtain these values once i Classified test set of my own corpus. 问题是我不明白是什么: y_pred = [0, 2, 1, 3]y_true = [0, 1, 2, 3]我怎样才能“达到”或获得这些值一旦我分类我的测试集自己的语料库。 Could anybody help me with this issue?. 有人可以帮我解决这个问题吗?

Let's say as an example the following: 让我们举个例子如下:

trainingdata: trainingdata:

Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.

testdata: 测试数据:

Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje
Estima-se que o mercado homossexual só na Cidade do México movimente cerca de oito mil milhões de dólares, aproximadamente seis mil milhões de euros


import codecs, re, time
from itertools import chain

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)

print results

There is a small error in your example. 您的示例中存在一个小错误。 The line: 这条线:

tags = ['SPAM','HAM','another_class']

is wrong. 是错的。 There should be a tag for each example (sentence/document) in your corpus. 您的语料库中的每个示例(句子/文档)都应该有一个标记。 So tags should be not 3 but the length of your trainset . 所以tags不应该是3,而是trainset的长度。

The same applies for the test set. 这同样适用于测试集。 You should have a variable test_tags that is the same length as testset . 你应该有一个变量test_tags ,其长度与testset相同。 These tags are normally a column inside the file 'test.txt' but you might get it from somewhere else. 这些标记通常是文件'test.txt'中的一列,但您可以从其他地方获取它。 This would be your y_true . 这将是你的y_true

When you predict on the test set you will get a vector of the same length as testset : 当您在测试集上预测时,您将获得与testset相同长度的向量:

results = mnb.predict(testset)

ie a tag prediction for each example in your test set. 即测试集中每个示例的标记预测。

This is your y_pred . 这是你的y_pred I omitted some details related to the multiclass vs single class case (material for another question) but this should answer your question. 我省略了一些与多类vs单类案例相关的细节(另一个问题的材料),但这应该回答你的问题。

I hope this would help you. 我希望这会对你有所帮助。 You asked: 您询问:

The problem is i dont understand what are: y_pred = [0, 2, 1, 3] and y_true = [0, 1, 2, 3] and how can i "reach" or obtain these values once i Classified test set of my own corpus. 问题是我不明白是什么:y_pred = [0,2,1,3]和y_true = [0,1,2,3]我怎样才能“达到”或获得这些值一旦我分类我的测试集自己的语料库。 Could anybody help me with this issue?. 有人可以帮我解决这个问题吗?

Answer: As you know, a classifier is supposed to classify data to different classes. 答:如您所知,分类器应该将数据分类到不同的类。 In the above example, the assumed data has had four distinct classes which were designated with labels 0,1,2, and 3. So, if our data was about classifying colors in uni-colored images the labels would represent something like: blue, red, yellow, and green. 在上面的例子中,假设数据有四个不同的类,用标签0,1,2和3指定。因此,如果我们的数据是关于在单色图像中对颜色进行分类,标签将代表如下:blue,红色,黄色和绿色。 The other issue that the above example shows is that there were only four smaples in the data. 上面的例子显示的另一个问题是数据中只有四个smaples。 For example, they had only four images, and y_true show their real labels (or as we call it groundtruth). 例如,他们只有四个图像, y_true显示他们的真实标签(或者我们称之为groundtruth)。 y_pred shows the prediction of the classifier. y_pred显示分类器的预测。 Now, if we compare the two lists if both were identical we had an accuracy of 100%, however, in this case you see that two of the labels predicted labels don't match their groundtruth. 现在,如果我们比较两个列表,如果两者都相同,我们的准确度为100%,但是,在这种情况下,您会看到两个标签预测的标签与他们的地面真实不符。

Now, in your sample code, you have written: 现在,在您的示例代码中,您已经写道:

tags = ['SPAM','HAM','another_class']

which like what I explained above, means that first of all, your data consists of 3 different classes; 就像我上面解释的那样,首先意味着你的数据由3个不同的类组成; and seconly, it shows that your data consists of 3 samples only (which is probably not what you actually wanted). 而且,它表明您的数据仅包含3个样本(可能不是您真正想要的样本)。 Thus, the length of this list should be equal to the number of samples in your training data. 因此,此列表的长度应等于训练数据中的样本数。 Let me know if you had further questions. 如果您有其他问题,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 scikit学习中的precision_score与Keras中的准确性之间的差异 - Difference between accuracy_score in scikit-learn and accuracy in Keras accuracy_score(来自 Scikit-learn)是计算总体准确度还是平均准确度? - Do accuracy_score (from Scikit-learn) compute overall accuracy or mean accuracy? 为什么我的accuracy_score 指标不正确? scikit 学习 - Why is my accuracy_score metric incorrect? scikit learn 拟合模型上的评分方法与scikit-learn的precision_score有什么区别? - What's the difference between the score method on a fitted model, vs accuracy_score from scikit-learn? 理解scikit-learn KMeans返回的“得分” - Understanding “score” returned by scikit-learn KMeans Scikit-Learn准确性分数未显示准确性 - Scikit-Learn accuracy score does not show accuracy Keras评估_生成器准确率和scikit学习accuracy_score不一致 - Keras evaluate_generator accuracy and scikit learn accuracy_score inconsistent Scikit学习返回错误的分类报告和准确性得分 - Scikit-learn returning incorrect classification report and accuracy score “标量变量的无效索引”-使用Scikit时学习“ accuracy_score” - “Invalid Index to Scalar Variable” - When Using Scikit Learn “accuracy_score” scikit-learn roc_auc_score()返回精度值 - scikit-learn roc_auc_score() returns accuracy values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM