简体   繁体   English

如何更快地从列表中删除包含某些单词的字符串

[英]How to remove strings containing certain words from list FASTER

There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything'] .有一个句子列表sentences = ['Ask the swordsmith', 'He knows everything'] The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing'] .目标是从单词表lexicon = ['word', 'every', 'thing']删除那些单词。 This can be achieved using the following list comprehension:这可以使用以下列表理解来实现:

newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]

Note that if not word in sentence is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, eg word is embedded in swordsmith , and every and thing are embedded in everything .请注意, if not word in sentence不是充分条件,因为它还会删除包含词中嵌入了词库中的词的句子,例如word被嵌入到swordsmith ,而everything都被嵌入到了everything

However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words.但是,我的句子列表包含 1.000.000 个句子和 200.000 个单词的词典。 Applying the list comprehension mentioned takes hours!应用提到的列表理解需要几个小时! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list.因此,我正在寻找一种更快的方法来从包含另一个列表中的单词的列表中删除字符串。 Any suggestions?有什么建议? Maybe using regex?也许使用正则表达式?

Do your lookup in a set .set查找。 This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.这使它变得更快,并减轻了包含问题,因为您只在词典中查找整个单词。

lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]

This is pretty efficient because w in lexicon is an O(1) operation, and any short-circuits.这是非常有效的,因为w in lexicon中的w in lexicon是一个O(1)操作,以及any短路。 The main issue is splitting your sentence into words properly.主要问题是将句子正确拆分为单词。 A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like.正则表达式不可避免地会比自定义解决方案慢,但可能是最佳选择,具体取决于您希望对标点符号等的鲁棒性。 For example:例如:

lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

You can optimize three things here:您可以在这里优化三件事:

<>Convert lexicon to set in order not to make in operation costless. <>将词典转换为set ,以免in操作中产生成本。

lexicon = set(lexicon)

<>Check intersection of sentence with lexicon in a most efficient way. <>以最有效的方式检查sentencelexicon交集。 It should use set operations.它应该使用set操作。 It was discussed here about performance of set intersection.这里讨论了集合交集的性能。

[x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

<> Use filter instead of list comprehension. <> 使用filter而不是列表理解。

list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Final code:最终代码:

lexicon = set(lexicon)
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Results结果

def removal_0(sentences, lexicon):
    lexicon = set(lexicon)
    pattern = re.compile(r'\w+')
    return [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

def removal_1(sentences, lexicon):
    lexicon = set(lexicon)
    return [x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

def removal_2(sentences, lexicon):
    lexicon = set(lexicon)
    return list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

%timeit removal_0(sentences, lexicon)
%timeit removal_1(sentences, lexicon)
%timeit removal_2(sentences, lexicon)

9.88 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.19 µs ± 55.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.76 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note.笔记。 So it seems filter is little bit slower but I don't know reasons yet.所以看起来过滤器有点慢,但我还不知道原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM