简体   繁体   中英

An evaluation of text classification method with Reuters-21578 dataset

Please do not block me for this question, i tried to find the answer for about a month and i can not find it and you are my last hope(please if you want to report it at first answer me and then report,thanks). I write an Hybrid text classification code in MATLAB and i did it correct but now i do not know how to evaluate the results. I know nothing about training set and examination set in Reuters-21578 and i can not understand them. my code finds the keywords in a text and with the help of a hybrid KNN algorithm put the text in its accurate class but the problem is that i do not know what are these candidate classes?should i make them or they are ready? If each .sgm file in Reuters-21578 is a class then how can i use them as a candidate class, i mean they are full of words, so should i classify them first and reach to choosen classes that other documents can be classified according to them?

The tag for each article/news can be considered as the class label. You can split the stories with topics into a training set, and a test set to evaluate your classifier. There are stories in reuters- 21578 without any topics, you can use your classifier to assign class labels to these.

Note: There are many stories with multiple topics.

I have been through the same. If the version of the reuters dataset doesn't matter to you, then reuters dataset is also available in nltk.corpus from which you can access the test documents, train documents and their respective categories easily. You do not have to worry about extracting them from .sgm files.

You can do this:

  from nltk.corpus import reuters 
  #This gives all  files
  documents = reuters.fileids()
  #to get only the training and testing documents
  train_docs = filter(lambda doc: doc.startswith("train"),documents);
  test_docs = filter(lambda doc: doc.startswith("test"),documents);
  #To get the raw data of a document
  data = reuters.raw(documents[0])
  #to get the categories/class in your case 
  category = reuters.categories(documents[0])

Now, you can use these to train and test. In a simple nut shell, test_docs and train_docs contain documents with raw content and their respective category which can be got by the above methods.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM