简体   繁体   中英

Python: Compare regex pattern in one text file against line in another

This works in smaller text files, but not on larger. (100,000 lines) How can I optimize for large text files? For line in fileA if regexPattern == line in fileB write (entire)line in fileA to fileC.

import re

with open('fileC.txt', 'w') as outfile:
    with open('fileA.txt', 'rU') as infile1:
        for line1 in infile1:
            y = re.findall(r'^.+,.+,(.+\.[a-z]+$)', line1)
                with open('fileB.txt', 'rU') as infile2:
                    for line2 in infile2:
                        if line2.strip() == y[0]:
                            outfile.write(line1)

The most immediate optimization is to read fileB.txt only once into a string buffer, then apply the test against the matched expression to that string buffer. You are currently opening and reading that file once for each line of fileA.txt .

It seems that your regex picks up whole lines that match a pattern, ie it starts with ^ and ends with $ . In this case, a more complete solution would be to load both fileA.txt and fileB.txt into arrays using readlines() , sort those arrays, then take a single pass through both files with two counters, eg:

# Details regarding the treatment of duplicate lines are ignored
# for clarity of exposition.
rai = sorted([7,6,1,9,11,6])
raj = sorted([4,6,11,7])
i, j = 0, 0
while i < len(rai) and j < len(raj):
    if   rai[i] < raj[j]: i += 1
    elif rai[i] > raj[j]: j += 1
    else:
        # I used mod in lieu of testing for your regex
        # since you didnt supply data
        if mod(rai[i],2): print rai[i]
        i, j = i + 1, j + 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM