为什么在python中标记时会得到几个列表？

Question

I am doing a data cleaning task using Python and reading from a text file which contains several sentences. 我正在使用Python执行数据清理任务，并从包含多个句子的文本文件中读取数据。 After tokenizing the text file I keep getting a list with the tokens for each sentence as follows: 在标记文本文件之后，我不断获得一个包含每个句子的标记的列表，如下所示：

[u'does', u'anyone', u'think', u'that', u'we', u'have', u'forgotten', u'the', u'days', u'of', u'favours', u'for', u'the', u'pn', u's', u'party', u's', u'friends', u'of', u'friends', u'and', u'paymasters', u'some', u'of', u'us', u'have', u'longer', u'memories']

[u'but', u'is', u'the', u'value', u'at', u'which', u'vassallo', u'brothers', u'bought', u'this', u'property', u'actually', u'relevant', u'and', u'represents', u'the', u'actual', u'value', u'of', u'the', u'property']

[u'these', u'monsters', u'are', u'wrecking', u'the', u'reef', u'the', u'cargo', u'vessels', u'have', u'been', u'there', u'for', u'weeks', u'and', u'the', u'passenger', u'ship', u'for', u'at', u'least', u'24', u'hours', u'now', u'https', u'uploads', u'disquscdn', u'com'].

The code I am doing is the following: 我正在执行的代码如下：

with open(file_path) as fp:
    comments = fp.readlines()

    for i in range (0, len(comments)):

        tokens = tokenizer.tokenize(no_html.lower())
        print tokens

Where no_html is the text file without any html tags. 其中no_html是没有任何html标记的文本文件。 Is there anyone who could tell me how to get all these tokens into one list please ? 请问有人可以告诉我如何将所有这些令牌放入一个列表吗？

Answer 1

Instead of using comments = fp.readlines() , try comments = fp.read() instead. 而不是使用comments = fp.readlines() ，而是尝试使用comments = fp.read() 。

What readlines does is it reads all the lines of a file and returns them in a list. readlines的作用是读取文件的所有行并将它们返回到列表中。

Another thing you can do is you can just join all the tokenized results into a single list. 您可以做的另一件事是，您可以将所有标记化的结果合并到一个列表中。

all_tokens = []
for i in range (0, len(comments)):

        tokens = tokenizer.tokenize(no_html.lower())
        #print tokens
        all_tokens.extend(tokens)

print all_tokens

为什么在python中标记时会得到几个列表？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-12-23 11:37:42

为什么在python中标记时会得到几个列表？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-12-23 11:37:42

解决方案1
1 已采纳 2016-12-23 11:37:42