简体   繁体   English

性能-比较Python中2个大型字符串列表的最快方法

[英]Performance - Fastest way to compare 2 large lists of strings in Python

I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences. 我必须使用Python列表,其中一个包含大约13000个不允许的短语,另一个包含大约10000个句子。

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. 我需要检查句子列表中的每个句子,以查看它是否包含短语列表中的任何短语,如果需要的话,我想在短语周围加上**并将其添加到另一个列表中。 I also need to do this in the fastest possible way. 我还需要以最快的方式做到这一点。

This is what I have so far: 这是我到目前为止的内容:

import re
for sentence in sentences:
    for phrase in phrases:
        if phrase in sentence.lower():
            iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
            newsentence = iphrase.sub("**"+phrase+"**", sentence)
            newlist.append(newsentence)

So far this approach takes about 60 seconds to complete. 到目前为止,此方法大约需要60秒才能完成。

I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. 我尝试使用多处理(每个句子的for循环分别映射),但是结果却更慢。 Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. 考虑到每个进程都在大约6%的CPU使用率下运行,看来开销使得将如此小的任务映射到不值得的多个内核上。 I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this. 我曾考虑过将句子列表分成较小的块,然后将它们映射到单独的进程,但是还没有弄清楚如何实现这一点。

I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings. 我也考虑过使用二进制搜索算法,但是还没有弄清楚如何在字符串中使用它。

So essentially, what would be the fastest possible way to perform this check? 因此,从本质上讲,执行此检查的最快方法是什么?

Build your regex once, sorting by longest phrase so you encompass the ** s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg: 一次构建您的正则表达式,按最长的短语排序,以便您在最长的匹配短语而不是最短的短语周围包含** ,执行替换并过滤掉未替换的短语,例如:

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    'can be really really',
    'characters',
    'some sentences'
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]

Gives you a results of: 给您以下results

['**some sentences** are longer',
 '**some sentences** **can be really really** ... really long, about 1000 **characters**.']

What about set comprehension? 集合理解呢?

found = {'**' + p + '**' for s in sentences for p in phrases if p in s}

You could try update (by reduction) the phrases list if you don't mind altering it: 如果您不介意更改phrases列表,可以尝试更新(减少) phrases列表:

found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
    for i in range(len(phrases)):
        phrase = phrases[i]
        if phrase in s:
            p.remove(phrase)
            found.append('**'+ phrase + '**')
    phrases = p[:]

Basically each iteration reduces the phrases container. 基本上,每次迭代都会减少phrases容器。 We iterate through the latest container until we find a phrase that is in at least one sentence. 我们遍历最新的容器,直到找到至少一个句子中的短语。

We remove it from the copied list then once we checked the latest phrases, we update the container with the reduced subset of phrases (those that haven't been seen yet). 我们从复制的列表中将其删除,然后在检查了最新的短语之后,我们使用减少的短语子集(尚未出现的短语)更新容器。 We do this since we only need to see a phrase at least once , so checking again (although it may exist in another sentence) is unnecessary. 我们这样做是因为我们只需要查看一个短语至少一次 ,因此不需要再次检查(尽管它可能存在于另一个句子中)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM