I have about 1600 articles in my database, with each article already pre-labeled with one of the following categories:
Technology
Science
Business
World
Health
Entertainment
Sports
I am trying to use sci-kit learn to build a classifier that would categorize new articles. (I guess i'll split my training data in half, for training and testing?)
I am looking to use tf-idf, as I don't have a list of stop-words (I can use NLTK to extract only adjectives and nouns, though, but i'd rather give scikit-learn the full article).
I've read all of the documentation on scikit-learn, but their examples involve word-occurence and N-grams (which are fine), but they never specify how to tie a piece of data to a label.
I've tried looking at their sample code, but it's too confusing to follow.
Could someone help me with this, or point me in the right direction?
Thanks.
I think you faced the same problem I did when I started to feed my own data to the classifiers.
You can use the function sklearn.datasets.load_files
, but to do so, you need to create this structure:
train
├── science
│ ├── 0001.txt
│ └── 0002.txt
└── technology
├── 0001.txt
└── 0002.txt
Where the subdirectories of train
are named as the labels, and each file within the labels directory is an article with that corresponding label. Then use load_files
to load the data:
In [1]: from sklearn.datasets import load_files
In [2]: load_files('train')
Out[2]:
{'DESCR': None,
'data': ['iphone apple smartphone\n',
'linux windows ubuntu\n',
'biology astrophysics\n',
'math\n'],
'filenames': array(['train/technology/0001.txt', 'train/technology/0002.txt',
'train/science/0002.txt', 'train/science/0001.txt'],
dtype='|S25'),
'target': array([1, 1, 0, 0]),
'target_names': ['science', 'technology']}
The object returned is a sklearn.datasets.base.Bunch
, which is a simple data wrapper. This is a straightforward approach to start playing with the classifiers, but when your data is larger and change frequently, you might want to stop using files and use, for example, a database to store the labeled documents and maybe having more structure than just plain text. Basically you will need to generate your list of categories (or target_names
) like ['science', 'technology', ...]
and assign the target
value for each document in the data
list as the index of the labeled category in the target_names
list. The length of data
and target
must be the same.
You can take a look to this script that I wrote time ago to run a classifier: https://github.com/darkrho/yatiri/blob/master/scripts/run_classifier.py#L267
Maybe start with the example here: http://scikit-learn.org/dev/auto_examples/document_classification_20newsgroups.html#example-document-classification-20newsgroups-py
A somewhat more advanced example is this: http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html#example-grid-search-text-feature-extraction-py
There are quite a few more text examples in the example gallery: http://scikit-learn.org/dev/auto_examples/index.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.