Understanding accuracy_score with scikit-learn with my own corpus?

Question

Suppose that i all ready do some text classification with scikit learn with SVC . First i vectorized the corpus, i split the data into test and train sets and then i set up the labels into the train set. Now i would like to obtain the accuracy of the classification.

From the documentation i read the following:

>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2

The problem is i dont understand what are: y_pred = [0, 2, 1, 3] and y_true = [0, 1, 2, 3] and how can i "reach" or obtain these values once i Classified test set of my own corpus. Could anybody help me with this issue?.

Let's say as an example the following:

trainingdata:

Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.

testdata:

Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje
Estima-se que o mercado homossexual só na Cidade do México movimente cerca de oito mil milhões de dólares, aproximadamente seis mil milhões de euros


import codecs, re, time
from itertools import chain

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

# Tagging the documents
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)

print results

Answer 1

There is a small error in your example. The line:

tags = ['SPAM','HAM','another_class']

is wrong. There should be a tag for each example (sentence/document) in your corpus. So tags should be not 3 but the length of your trainset .

The same applies for the test set. You should have a variable test_tags that is the same length as testset . These tags are normally a column inside the file 'test.txt' but you might get it from somewhere else. This would be your y_true .

When you predict on the test set you will get a vector of the same length as testset :

results = mnb.predict(testset)

ie a tag prediction for each example in your test set.

This is your y_pred . I omitted some details related to the multiclass vs single class case (material for another question) but this should answer your question.

Answer 2

I hope this would help you. You asked:

The problem is i dont understand what are: y_pred = [0, 2, 1, 3] and y_true = [0, 1, 2, 3] and how can i "reach" or obtain these values once i Classified test set of my own corpus. Could anybody help me with this issue?.

Answer: As you know, a classifier is supposed to classify data to different classes. In the above example, the assumed data has had four distinct classes which were designated with labels 0,1,2, and 3. So, if our data was about classifying colors in uni-colored images the labels would represent something like: blue, red, yellow, and green. The other issue that the above example shows is that there were only four smaples in the data. For example, they had only four images, and y_true show their real labels (or as we call it groundtruth). y_pred shows the prediction of the classifier. Now, if we compare the two lists if both were identical we had an accuracy of 100%, however, in this case you see that two of the labels predicted labels don't match their groundtruth.

Now, in your sample code, you have written:

tags = ['SPAM','HAM','another_class']

which like what I explained above, means that first of all, your data consists of 3 different classes; and seconly, it shows that your data consists of 3 samples only (which is probably not what you actually wanted). Thus, the length of this list should be equal to the number of samples in your training data. Let me know if you had further questions.

Understanding accuracy_score with scikit-learn with my own corpus?

Question

2 answers

solution1
2 2014-12-19 02:33:31

solution2
2 ACCPTED 2014-12-19 03:02:18

Understanding accuracy_score with scikit-learn with my own corpus?

Question

2 answers

solution1 2 2014-12-19 02:33:31

solution2 2 ACCPTED 2014-12-19 03:02:18

solution1
2 2014-12-19 02:33:31

solution2
2 ACCPTED 2014-12-19 03:02:18