Using Sci-Kit learn to classify text with a large corpus

Question

I have about 1600 articles in my database, with each article already pre-labeled with one of the following categories:

Technology
Science
Business
World
Health
Entertainment
Sports

I am trying to use sci-kit learn to build a classifier that would categorize new articles. (I guess i'll split my training data in half, for training and testing?)

I am looking to use tf-idf, as I don't have a list of stop-words (I can use NLTK to extract only adjectives and nouns, though, but i'd rather give scikit-learn the full article).

I've read all of the documentation on scikit-learn, but their examples involve word-occurence and N-grams (which are fine), but they never specify how to tie a piece of data to a label.

I've tried looking at their sample code, but it's too confusing to follow.

Could someone help me with this, or point me in the right direction?

Thanks.

Answer 1

I think you faced the same problem I did when I started to feed my own data to the classifiers.

You can use the function sklearn.datasets.load_files , but to do so, you need to create this structure:

train
├── science
│   ├── 0001.txt
│   └── 0002.txt
└── technology
    ├── 0001.txt
    └── 0002.txt

Where the subdirectories of train are named as the labels, and each file within the labels directory is an article with that corresponding label. Then use load_files to load the data:

In [1]: from sklearn.datasets import load_files

In [2]: load_files('train')
Out[2]: 
{'DESCR': None,
 'data': ['iphone apple smartphone\n',
  'linux windows ubuntu\n',
  'biology astrophysics\n',
  'math\n'],
 'filenames': array(['train/technology/0001.txt', 'train/technology/0002.txt',
       'train/science/0002.txt', 'train/science/0001.txt'], 
      dtype='|S25'),
 'target': array([1, 1, 0, 0]),
 'target_names': ['science', 'technology']}

The object returned is a sklearn.datasets.base.Bunch , which is a simple data wrapper. This is a straightforward approach to start playing with the classifiers, but when your data is larger and change frequently, you might want to stop using files and use, for example, a database to store the labeled documents and maybe having more structure than just plain text. Basically you will need to generate your list of categories (or target_names ) like ['science', 'technology', ...] and assign the target value for each document in the data list as the index of the labeled category in the target_names list. The length of data and target must be the same.

You can take a look to this script that I wrote time ago to run a classifier: https://github.com/darkrho/yatiri/blob/master/scripts/run_classifier.py#L267

Answer 2

Maybe start with the example here: http://scikit-learn.org/dev/auto_examples/document_classification_20newsgroups.html#example-document-classification-20newsgroups-py

A somewhat more advanced example is this: http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html#example-grid-search-text-feature-extraction-py

There are quite a few more text examples in the example gallery: http://scikit-learn.org/dev/auto_examples/index.html

Using Sci-Kit learn to classify text with a large corpus

Question

2 answers

solution1
11 2013-10-14 13:33:50

solution2
2 2013-10-12 22:18:26

Using Sci-Kit learn to classify text with a large corpus

Question

2 answers

solution1 11 2013-10-14 13:33:50

solution2 2 2013-10-12 22:18:26

solution1
11 2013-10-14 13:33:50

solution2
2 2013-10-12 22:18:26