Python 正則表達式性能：使用數千個正則表達式迭代文本的最佳方法

Question

我進行了很多研究，但沒有找到任何真正幫助我的東西。 也許我的方法很奇怪——也許有人可以讓我的想法朝着正確的方向發展。

所以情況如下：

我需要處理大量文本（數十萬）。 在這些文本中，我需要查找和處理某些字符串：

我從數據庫中提取的某些“靜態”子字符串（如案例編號）（也有數十萬個）
我與正則表達式匹配的字符串，該正則表達式動態構建以匹配每個可能出現的情況——正則表達式的最后一部分將被動態設置

很明顯，這會導致大量的迭代，因為每個文本都需要輸入到 function 中，以運行數十萬個正則表達式——畢竟這會導致非常長的運行時間。

有沒有更好更快的方法來完成所需的任務？ 現在完成的方式有效，但速度非常慢，並且在服務器上施加了數周的沉重負載。

一些示例代碼來說明我的想法：

import re

cases = []          # 100 000 case numbers from db
suffixes = []       #  500 diffrent suffixes to try from db

texts = []          # 100 000 for the beginning - will become less after initial run

def process_item(text: str) -> str:
    for s in suffixes:
        pattern = '(...)(.*?)(%s|...)' % s
        x = re.findall(pattern, text, re.IGNORECASE)
        for match in x:
            # process the matches, where I need to know which suffix matched
            pass
    for c in cases:
        escaped = re.escape(c)
        x = re.findall(escaped, text, re.IGNORECASE)
        for match in x:
            # process the matches, where I need to know which number matched
            pass

    return text


for text in texts:
    processed = process_item(text)

每個想法都受到高度贊賞！

Answer 1

我無法發表評論，但只是一些想法：

從您發布的內容來看，您想要搜索的東西總是相同的，所以為什么不將它們加入大正則表達式並在運行循環之前編譯那個大正則表達式。

這樣您就不必為每次迭代編譯正則表達式，而只需編譯一次。

例如

import re

cases = []          # 100 000 case numbers from db
suffixes = []       #  500 diffrent suffixes to try from db

texts = []          # 100 000 for the beginning - will become less after initial run

bre1 = re.compile('|'.join(suffixes), re.IGNORECASE)
bre2 = re.compile('|'.join([re.escape(c) for c in cases]), re.IGNORECASE)

def process_item(text: str) -> str:
    x = re.findall(bre1, text)
    for match in x:
        # process the matches, where I need to know which suffix matched
        pass

   x = re.findall(bre1, text)
   for match in x:
       # process the matches, where I need to know which number matched
       pass

    return text


for text in texts:
    processed = process_item(text)

如果您可以在text中可靠地找到case number （例如，如果它前面有一些標識符），最好使用re.search找到案例編號，並在set中設置案例編號並測試該集合中的成員資格。

例如

cases = ["123", "234"]
cases_set = set(cases)

texts = ["id:123", "id:548"]

sre = re.compile(r'(?<=id:)\d{3}')
for t in texts:
    m = re.search(sre, t)
    if m and m.group() in cases_set:
        # do stuff ....
        pass

Python 正則表達式性能：使用數千個正則表達式迭代文本的最佳方法

問題描述

1 個解決方案

解決方案1
2 已采納 2019-10-21 14:09:26

Python 正則表達式性能：使用數千個正則表達式迭代文本的最佳方法

問題描述

1 個解決方案

解決方案1 2 已采納 2019-10-21 14:09:26

解決方案1
2 已采納 2019-10-21 14:09:26