性能-比較Python中2個大型字符串列表的最快方法

Question

我必須使用Python列表，其中一個包含大約13000個不允許的短語，另一個包含大約10000個句子。

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

我需要檢查句子列表中的每個句子，以查看它是否包含短語列表中的任何短語，如果需要的話，我想在短語周圍加上**並將其添加到另一個列表中。 我還需要以最快的方式做到這一點。

這是我到目前為止的內容：

import re
for sentence in sentences:
    for phrase in phrases:
        if phrase in sentence.lower():
            iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
            newsentence = iphrase.sub("**"+phrase+"**", sentence)
            newlist.append(newsentence)

到目前為止，此方法大約需要60秒才能完成。

我嘗試使用多處理（每個句子的for循環分別映射），但是結果卻更慢。 考慮到每個進程都在大約6％的CPU使用率下運行，看來開銷使得將如此小的任務映射到不值得的多個內核上。 我曾考慮過將句子列表分成較小的塊，然后將它們映射到單獨的進程，但是還沒有弄清楚如何實現這一點。

我也考慮過使用二進制搜索算法，但是還沒有弄清楚如何在字符串中使用它。

因此，從本質上講，執行此檢查的最快方法是什么？

Answer 1

一次構建您的正則表達式，按最長的短語排序，以便您在最長的匹配短語而不是最短的短語周圍包含** ，執行替換並過濾掉未替換的短語，例如：

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    'can be really really',
    'characters',
    'some sentences'
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]

給您以下results ：

['**some sentences** are longer',
 '**some sentences** **can be really really** ... really long, about 1000 **characters**.']

Answer 2

集合理解呢？

found = {'**' + p + '**' for s in sentences for p in phrases if p in s}

如果您不介意更改phrases列表，可以嘗試更新（減少） phrases列表：

found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
    for i in range(len(phrases)):
        phrase = phrases[i]
        if phrase in s:
            p.remove(phrase)
            found.append('**'+ phrase + '**')
    phrases = p[:]

基本上，每次迭代都會減少phrases容器。 我們遍歷最新的容器，直到找到至少一個句子中的短語。

我們從復制的列表中將其刪除，然后在檢查了最新的短語之后，我們使用減少的短語子集（尚未出現的短語）更新容器。 我們這樣做是因為我們只需要查看一個短語至少一次 ，因此不需要再次檢查（盡管它可能存在於另一個句子中）。

性能-比較Python中2個大型字符串列表的最快方法

問題描述

2 個解決方案

解決方案1
3 已采納 2018-05-11 05:51:44

解決方案2
0 2018-05-11 04:53:04

性能-比較Python中2個大型字符串列表的最快方法

問題描述

2 個解決方案

解決方案1 3 已采納 2018-05-11 05:51:44

解決方案2 0 2018-05-11 04:53:04

解決方案1
3 已采納 2018-05-11 05:51:44

解決方案2
0 2018-05-11 04:53:04