[英]Why do I get several lists when tokenizing in python?
I am doing a data cleaning task using Python and reading from a text file which contains several sentences. 我正在使用Python执行数据清理任务,并从包含多个句子的文本文件中读取数据。 After tokenizing the text file I keep getting a list with the tokens for each sentence as follows:
在标记文本文件之后,我不断获得一个包含每个句子的标记的列表,如下所示:
[u'does', u'anyone', u'think', u'that', u'we', u'have', u'forgotten', u'the', u'days', u'of', u'favours', u'for', u'the', u'pn', u's', u'party', u's', u'friends', u'of', u'friends', u'and', u'paymasters', u'some', u'of', u'us', u'have', u'longer', u'memories']
[u'but', u'is', u'the', u'value', u'at', u'which', u'vassallo', u'brothers', u'bought', u'this', u'property', u'actually', u'relevant', u'and', u'represents', u'the', u'actual', u'value', u'of', u'the', u'property']
[u'these', u'monsters', u'are', u'wrecking', u'the', u'reef', u'the', u'cargo', u'vessels', u'have', u'been', u'there', u'for', u'weeks', u'and', u'the', u'passenger', u'ship', u'for', u'at', u'least', u'24', u'hours', u'now', u'https', u'uploads', u'disquscdn', u'com'].
The code I am doing is the following: 我正在执行的代码如下:
with open(file_path) as fp:
comments = fp.readlines()
for i in range (0, len(comments)):
tokens = tokenizer.tokenize(no_html.lower())
print tokens
Where no_html is the text file without any html tags. 其中no_html是没有任何html标记的文本文件。 Is there anyone who could tell me how to get all these tokens into one list please ?
请问有人可以告诉我如何将所有这些令牌放入一个列表吗?
Instead of using comments = fp.readlines()
, try comments = fp.read()
instead. 而不是使用
comments = fp.readlines()
,而是尝试使用comments = fp.read()
。
What readlines does is it reads all the lines of a file and returns them in a list. readlines的作用是读取文件的所有行并将它们返回到列表中。
Another thing you can do is you can just join all the tokenized results into a single list. 您可以做的另一件事是,您可以将所有标记化的结果合并到一个列表中。
all_tokens = []
for i in range (0, len(comments)):
tokens = tokenizer.tokenize(no_html.lower())
#print tokens
all_tokens.extend(tokens)
print all_tokens
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.