在Python中替換大型文本文件中的多個字符串

Question

問題：

在大型文本文件中替換多個字符串模式需要花費大量時間。 （蟒蛇）

場景：

我有一個沒有特定結構的大文本文件。 但是，它包含幾種模式。 例如，電子郵件地址和電話號碼。

文本文件具有100多種不同的模式，文件大小為10mb（大小可能會增加）。 文本文件可能包含或不包含全部100個模式。

目前，我正在使用re.sub()替換匹配項，執行替換的方法如下所示。

readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines 

for line in lines:
    if len(line.strip()) != 0: # strip the empty lines
        linestr += line

for pattern in patterns: # patterns contains all regex and respective replaces
    regex = pattern[0]
    replace = pattern[1]
    compiled_regex = compile_regex(regex)
    linestr = re.sub(compiled_regex, replace, linestr)

對於大型文件，此方法要花費大量時間。 有更好的優化方法嗎？

我正在考慮用.join()替換+= ，但是不確定有什么幫助。

Answer 1

您可以使用lineprofiler查找代碼中哪些行花費的時間最多

pip install line_profiler    
kernprof -l run.py

另一件事，我認為您正在構建的字符串在內存中過大，也許您可以利用生成器

Answer 2

您可以通過以下操作獲得更好的結果：

large_list = []

with gzip.open(path, 'r') as fp:
    for line in fp.readlines():
        if line.strip():
            large_list.append(line)

merged_lines = ''.join(large_list)

for regex, replace in patterns:
    compiled_regex = compile_regex(regex)
    merged_lines = re.sub(compiled_regex, replace, merged_lines)

但是，知道您應用哪種處理，就可以實現進一步的優化。 實際上，最后一行將占用所有CPU能力（和內存分配）。 如果可以逐行應用正則表達式，則可以使用多處理程序包獲得出色的結果。 由於GIL（ https://wiki.python.org/moin/GlobalInterpreterLock ），線程無法給您任何好處

在Python中替換大型文本文件中的多個字符串

問題描述

2 個解決方案

解決方案1
2 2016-12-16 22:13:56

解決方案2
1 2016-12-16 22:27:28

在Python中替換大型文本文件中的多個字符串

問題描述

2 個解決方案

解決方案1 2 2016-12-16 22:13:56

解決方案2 1 2016-12-16 22:27:28

解決方案1
2 2016-12-16 22:13:56

解決方案2
1 2016-12-16 22:27:28