简体   繁体   English

为什么在python中标记时会得到几个列表?

[英]Why do I get several lists when tokenizing in python?

I am doing a data cleaning task using Python and reading from a text file which contains several sentences. 我正在使用Python执行数据清理任务,并从包含多个句子的文本文件中读取数据。 After tokenizing the text file I keep getting a list with the tokens for each sentence as follows: 在标记文本文件之后,我不断获得一个包含每个句子的标记的列表,如下所示:

[u'does', u'anyone', u'think', u'that', u'we', u'have', u'forgotten', u'the', u'days', u'of', u'favours', u'for', u'the', u'pn', u's', u'party', u's', u'friends', u'of', u'friends', u'and', u'paymasters', u'some', u'of', u'us', u'have', u'longer', u'memories']

[u'but', u'is', u'the', u'value', u'at', u'which', u'vassallo', u'brothers', u'bought', u'this', u'property', u'actually', u'relevant', u'and', u'represents', u'the', u'actual', u'value', u'of', u'the', u'property']

[u'these', u'monsters', u'are', u'wrecking', u'the', u'reef', u'the', u'cargo', u'vessels', u'have', u'been', u'there', u'for', u'weeks', u'and', u'the', u'passenger', u'ship', u'for', u'at', u'least', u'24', u'hours', u'now', u'https', u'uploads', u'disquscdn', u'com'].

The code I am doing is the following: 我正在执行的代码如下:

with open(file_path) as fp:
    comments = fp.readlines()

    for i in range (0, len(comments)):

        tokens = tokenizer.tokenize(no_html.lower())
        print tokens

Where no_html is the text file without any html tags. 其中no_html是没有任何html标记的文本文件。 Is there anyone who could tell me how to get all these tokens into one list please ? 请问有人可以告诉我如何将所有这些令牌放入一个列表吗?

Instead of using comments = fp.readlines() , try comments = fp.read() instead. 而不是使用comments = fp.readlines() ,而是尝试使用comments = fp.read()

What readlines does is it reads all the lines of a file and returns them in a list. readlines的作用是读取文件的所有行并将它们返回到列表中。

Another thing you can do is you can just join all the tokenized results into a single list. 您可以做的另一件事是,您可以将所有标记化的结果合并到一个列表中。

all_tokens = []
for i in range (0, len(comments)):

        tokens = tokenizer.tokenize(no_html.lower())
        #print tokens
        all_tokens.extend(tokens)

print all_tokens

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当我尝试从标准输入读取多个中间有空格的字符串时,为什么在 Python 中出现错误? - Why do I get an error in Python when I try to read several strings with spaces in between from stdin? 为什么在合并两个排序列表时会得到两个不同的输出(Python) - Why do I get two different Outputs, when merging two sorted lists (Python) python 中列表的性质,为什么我会得到重复列表? - The nature of lists in python, why do I get a repeating list? 如何在python中添加多个列表的元素 - How do I add the elements of several lists in python 为什么在标记文本语料库时需要阈值? - Why do you need a threshold when tokenizing a text corpus? 如何在python中构建基于标记化正则表达式的迭代器 - How do I build a tokenizing regex based iterator in python Python标记化文本:如何将标记化列表转换为字符串? - Python tokenizing text: How do I turn a tokenized list into a string? 在 NLTK 中标记化时如何忽略特殊字符? - How do I ignore special characters when tokenizing in NLTK? 在 Python 中,为什么计数器会出现“未定义的局部变量”错误,而列表却没有? - In Python, why do I get an 'undefined local variable' error for counters but not for lists? 在Python控制台上键入2或3时,为什么我没有得到True? - Why do I not get True when I type 2 or 3 at the Python console?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM