性能-比较Python中2个大型字符串列表的最快方法

Question

我必须使用Python列表，其中一个包含大约13000个不允许的短语，另一个包含大约10000个句子。

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

我需要检查句子列表中的每个句子，以查看它是否包含短语列表中的任何短语，如果需要的话，我想在短语周围加上**并将其添加到另一个列表中。 我还需要以最快的方式做到这一点。

这是我到目前为止的内容：

import re
for sentence in sentences:
    for phrase in phrases:
        if phrase in sentence.lower():
            iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
            newsentence = iphrase.sub("**"+phrase+"**", sentence)
            newlist.append(newsentence)

到目前为止，此方法大约需要60秒才能完成。

我尝试使用多处理（每个句子的for循环分别映射），但是结果却更慢。 考虑到每个进程都在大约6％的CPU使用率下运行，看来开销使得将如此小的任务映射到不值得的多个内核上。 我曾考虑过将句子列表分成较小的块，然后将它们映射到单独的进程，但是还没有弄清楚如何实现这一点。

我也考虑过使用二进制搜索算法，但是还没有弄清楚如何在字符串中使用它。

因此，从本质上讲，执行此检查的最快方法是什么？

Answer 1

一次构建您的正则表达式，按最长的短语排序，以便您在最长的匹配短语而不是最短的短语周围包含** ，执行替换并过滤掉未替换的短语，例如：

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    'can be really really',
    'characters',
    'some sentences'
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]

给您以下results ：

['**some sentences** are longer',
 '**some sentences** **can be really really** ... really long, about 1000 **characters**.']

Answer 2

集合理解呢？

found = {'**' + p + '**' for s in sentences for p in phrases if p in s}

如果您不介意更改phrases列表，可以尝试更新（减少） phrases列表：

found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
    for i in range(len(phrases)):
        phrase = phrases[i]
        if phrase in s:
            p.remove(phrase)
            found.append('**'+ phrase + '**')
    phrases = p[:]

基本上，每次迭代都会减少phrases容器。 我们遍历最新的容器，直到找到至少一个句子中的短语。

我们从复制的列表中将其删除，然后在检查了最新的短语之后，我们使用减少的短语子集（尚未出现的短语）更新容器。 我们这样做是因为我们只需要查看一个短语至少一次 ，因此不需要再次检查（尽管它可能存在于另一个句子中）。

性能-比较Python中2个大型字符串列表的最快方法

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-05-11 05:51:44

解决方案2
0 2018-05-11 04:53:04

性能-比较Python中2个大型字符串列表的最快方法

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-05-11 05:51:44

解决方案2 0 2018-05-11 04:53:04

解决方案1
3 已采纳 2018-05-11 05:51:44

解决方案2
0 2018-05-11 04:53:04