如何在Python中使用nltk.corpus逐行读取和标记文本文件

Question

My problem is to classify documents given two training data good_reviews.txt and bad_reviews.txt . 我的问题是给定两个训练数据good_reviews.txt和bad_reviews.txt来对文档进行分类。 So to start I need to load and label my training data where every line is a document itself which corresponds to a review. 因此，开始时，我需要加载并标记我的训练数据，其中每一行都是与评论相对应的文档本身。 So my main task is to classify reviews (lines) from a given testing data. 因此，我的主要任务是根据给定的测试数据对评论（行）进行分类。

I found a way how to load and label names data as follow: 我找到了一种如下方法来加载和标记名称数据：

from nltk.corpus import names
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])

So what I want to have is a similar thing which labels lines and not words . 所以我想要的是类似的东西，它标记lines而不是words 。 I am expecting that the code would be something like this which of course doesn't work since .lines is an invalid syntax: 我期望代码会像这样，由于.lines是无效的语法，因此当然不起作用：

reviews = ([(review, 'good_review') for review in reviews.lines('good_reviews.txt')] +
           [(review, 'bad_review') for review in reviews.lines('bad_reviews.txt')])

and I would like to have a result like this: 我想要这样的结果：

>>> reviews[0]
('This shampoo is very good blablabla...', 'good_review')

Answer 1

If you're reading your own textfile, then there's nothing much to do with NLTK , you can simply use file.readlines() : 如果您正在读取自己的文本文件，则与NLTK无关，您只需使用file.readlines() ：

good_reviews = """This is great!
Wow, it amazes me...
An hour of show, a lifetime of enlightment
"""
bad_reviews = """Comme si, Comme sa.
I just wasted my foo bar on this.
An hour of s**t, ****.
"""
with open('/tmp/good_reviews.txt', 'w') as fout:
    fout.write(good_reviews)
with open('/tmp/bad_reviews.txt', 'w') as fout:
    fout.write(bad_reviews)

reviews = []
with open('/tmp/good_reviews.txt', 'r') as fingood, open('/tmp/bad_reviews.txt', 'r') as finbad:
    reviews = ([(review, 'good_review') for review in fingood.readlines()] + [(review, 'bad_review') for review in finbad.readlines()])

print reviews

[out]: [OUT]：

[('This is great!\n', 'good_review'), ('Wow, it amazes me...\n', 'good_review'), ('An hour of show, a lifetime of enlightment\n', 'good_review'), ('Comme si, Comme sa.\n', 'bad_review'), ('I just wasted my foo bar on this.\n', 'bad_review'), ('An hour of s**t, ****.\n', 'bad_review')]

If you're going to use the NLTK movie review corpus, see Classification using movie review corpus in NLTK/Python 如果要使用NLTK电影评论语料库，请参阅NLTK / Python中的使用电影评论语料库分类。

如何在Python中使用nltk.corpus逐行读取和标记文本文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-04-27 22:06:43

如何在Python中使用nltk.corpus逐行读取和标记文本文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-04-27 22:06:43

解决方案1
1 已采纳 2014-04-27 22:06:43