简体   繁体   English

如何快速从列表中获取唯一的单词?

[英]How to get unique words from a list quickly?

I have a file with 3 million sentences (approx).我有一个包含 300 万个句子(大约)的文件。 Each sentence has around 60 words.每个句子大约有 60 个单词。 I want to combine all the words and find unique words from them.我想把所有的词组合起来,从中找出独特的词。

I tried the following code:我尝试了以下代码:

 final_list = list()
 for sentence in sentence_list:
     words_list = nltk.word_tokenize(sentence)
     words = [word for word in words_list if word not in stopwords.words('english') ]
     final_list = final_list + set(words)

This code gives unique words but, it's taking too long to process.此代码给出了独特的单词,但处理时间太长。 Around 50k sentences per hour.每小时大约 5 万个句子。 It might take 3 days to process.处理时间可能需要 3 天。

I tried with lambda function too:我也尝试使用 lambda 函数:

    final_list = list(map(lambda x: list(set([word for word in sentence])) ,sentence_list))

But, there is no significant improvement in execution.但是,执行力没有显着改善。 Please suggest a better solution with an effective time of execution.请提出有效执行时间的更好解决方案。 Parallel processing suggestions are welcome.欢迎并行处理建议。

You need to do it all lazily and with as few intermediate lists and as possible (reducing allocations and processing time).你需要懒惰地完成这一切,并尽可能少地使用中间列表(减少分配和处理时间)。 All unique words from a file:文件中的所有唯一单词:

import itertools
def unique_words_from_file(fpath):
    with open(fpath, "r") as f:
        return set(itertools.chain.from_iterable(map(str.split, f)))

Let's explain the ideas here.让我们解释一下这里的想法。

File objects are iterable objects, which means that you can iterate over the lines of a file!文件对象是可迭代对象,这意味着您可以遍历文件的行!

Then we want the words from each line, which is splitting them.然后我们想要每一行的单词,这就是拆分它们。 In this case, we use map in Python3 (or itertools.imap in Python2 ) to create an object with that computation over our file lines.在这种情况下,我们使用Python3中的map (或Python2中的itertools.imap )来创建一个对象,该对象在我们的文件行上进行计算。 map and imap are also lazy, which means that no intermediate list is allocated by default and that is awesome because we will not be spending any resources on something we don't need! mapimap也是惰性的,这意味着默认情况下不会分配任何中间列表,这很棒,因为我们不会在不需要的东西上花费任何资源!

Since str.split returns a list, our map result would be a succession of lists of strings, but we need to iterate over each of those strings.由于str.split返回一个列表,我们的map结果将是一系列字符串列表,但我们需要遍历每个字符串。 For doing that there is no need of building another list , we can use itertools.chain to flatten that result!为此,无需构建另一个list ,我们可以使用itertools.chain来展平该结果!

Finally, we call to set, which will iterate over those words and kept just a single one for each of them.最后,我们调用 set,它将遍历这些单词并为每个单词保留一个单词。 Voila!瞧!

Let's make an improvement!让我们改进一下! can we make str.split also lazy?我们可以让str.split也变得懒惰吗? Yes!是的! check this SO answer :检查这个SO 答案

import itertools
import re

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

def unique_words_from_file(fpath):
    with open(fpath, "r") as f:
        return set(itertools.chain.from_iterable(map(split_iter, f)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM