Python：比較一個文本文件中的正則表達式模式與另一個文本文件中的行

Question

這適用於較小的文本文件，但不適用於較大的文本文件 （100,000行）如何優化大型文本文件？ 對於fileA中的行，如果regexPattern == fileB中的行將fileA中的（整個）行寫入fileC。

import re

with open('fileC.txt', 'w') as outfile:
    with open('fileA.txt', 'rU') as infile1:
        for line1 in infile1:
            y = re.findall(r'^.+,.+,(.+\.[a-z]+$)', line1)
                with open('fileB.txt', 'rU') as infile2:
                    for line2 in infile2:
                        if line2.strip() == y[0]:
                            outfile.write(line1)

Answer 1

最直接的優化是只將fileB.txt讀入字符串緩沖區一次，然后將匹配表達式的測試應用於該字符串緩沖區。 您當前正在為fileA.txt每一行打開和讀取該文件一次。

似乎你的正則表達式選擇了匹配模式的整行，即它以^開頭並以$結尾。 在這種情況下，更完整的解決方案是使用readlines()將fileA.txt和fileB.txt到數組中，對這些數組進行排序，然后通過兩個計數器單次傳遞兩個文件，例如：

# Details regarding the treatment of duplicate lines are ignored
# for clarity of exposition.
rai = sorted([7,6,1,9,11,6])
raj = sorted([4,6,11,7])
i, j = 0, 0
while i < len(rai) and j < len(raj):
    if   rai[i] < raj[j]: i += 1
    elif rai[i] > raj[j]: j += 1
    else:
        # I used mod in lieu of testing for your regex
        # since you didnt supply data
        if mod(rai[i],2): print rai[i]
        i, j = i + 1, j + 1

Python：比較一個文本文件中的正則表達式模式與另一個文本文件中的行

問題描述

1 個解決方案

解決方案1
0 2015-02-12 07:05:46

Python：比較一個文本文件中的正則表達式模式與另一個文本文件中的行

問題描述

1 個解決方案

解決方案1 0 2015-02-12 07:05:46

解決方案1
0 2015-02-12 07:05:46