[英]How to remove strings containing certain words from list FASTER
There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']
.有一个句子列表
sentences = ['Ask the swordsmith', 'He knows everything']
。 The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']
.目标是从单词表
lexicon = ['word', 'every', 'thing']
删除那些单词。 This can be achieved using the following list comprehension:这可以使用以下列表理解来实现:
newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]
Note that if not word in sentence
is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, eg word
is embedded in swordsmith
, and every
and thing
are embedded in everything
.请注意,
if not word in sentence
不是充分条件,因为它还会删除包含词中嵌入了词库中的词的句子,例如word
被嵌入到swordsmith
,而every
和thing
都被嵌入到了everything
。
However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words.但是,我的句子列表包含 1.000.000 个句子和 200.000 个单词的词典。 Applying the list comprehension mentioned takes hours!
应用提到的列表理解需要几个小时! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list.
因此,我正在寻找一种更快的方法来从包含另一个列表中的单词的列表中删除字符串。 Any suggestions?
有什么建议? Maybe using regex?
也许使用正则表达式?
Do your lookup in a set
.在
set
查找。 This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.这使它变得更快,并减轻了包含问题,因为您只在词典中查找整个单词。
lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]
This is pretty efficient because w in lexicon
is an O(1)
operation, and any
short-circuits.这是非常有效的,因为
w in lexicon
中的w in lexicon
是一个O(1)
操作,以及any
短路。 The main issue is splitting your sentence into words properly.主要问题是将句子正确拆分为单词。 A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like.
正则表达式不可避免地会比自定义解决方案慢,但可能是最佳选择,具体取决于您希望对标点符号等的鲁棒性。 For example:
例如:
lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]
You can optimize three things here:您可以在这里优化三件事:
<>Convert lexicon to set
in order not to make in
operation costless. <>将词典转换为
set
,以免in
操作中产生成本。
lexicon = set(lexicon)
<>Check intersection of sentence
with lexicon
in a most efficient way. <>以最有效的方式检查
sentence
与lexicon
交集。 It should use set
operations.它应该使用
set
操作。 It was discussed here about performance of set intersection.这里讨论了集合交集的性能。
[x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]
<> Use filter
instead of list comprehension. <> 使用
filter
而不是列表理解。
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))
Final code:最终代码:
lexicon = set(lexicon)
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))
Results结果
def removal_0(sentences, lexicon):
lexicon = set(lexicon)
pattern = re.compile(r'\w+')
return [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]
def removal_1(sentences, lexicon):
lexicon = set(lexicon)
return [x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]
def removal_2(sentences, lexicon):
lexicon = set(lexicon)
return list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))
%timeit removal_0(sentences, lexicon)
%timeit removal_1(sentences, lexicon)
%timeit removal_2(sentences, lexicon)
9.88 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.19 µs ± 55.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.76 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note.笔记。 So it seems filter is little bit slower but I don't know reasons yet.
所以看起来过滤器有点慢,但我还不知道原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.