简体   繁体   中英

Python: efficient file io

What is the most efficient (fastest) way to simultaneously read in two large files and do some processing?

I have two files; a.txt and b.txt, each containing about a hundred thousand corresponding lines. My goal is to read in the two files and then do some processing on each line pair

def kernel:
    a_file=open('a.txt','r')
    b_file=open('b.txt', 'r')
    a_line = a_file.readline()
    b_line = b_file.readline()
    while a_line:
        process(a_spl,b_spl) #process requiring both corresponding file lines

I looked in to xreadlines and readlines but i'm wondering if i can do better. speed is of paramount importance for this task.

thank you.

The below code does not accumulate data from the input files in memory, unless the process function does that by itself.

from itertools import izip

def process(line1, line2):
  # process a line from each input

with open(file1, 'r') as f1:
  with open(file2, 'r') as f2:
    for a, b in izip(f1, f2):
      process(a, b)

If the process function is efficient, this code should run quickly enough for most purposes. The for loop will terminate when the end of one of the files is reached. If either file contains an extraordinarily long line (ie XML, JSON), or if the files are not text, this code may not work well.

You can use with statement to make sure your files are closed after the execution. From this blog entry :

to open a file, process its contents, and make sure to close it, you can simply do:

with open("x.txt") as f:
    data = f.read()
    do something with data

String IO can be pretty fast -- probably your processing will be what slows things down. Consider a simple input loop to feed a queue like:

queue = multiprocessing.Queue(100)
a_file = open('a.txt')
b_file = open('b.txt')
for pair in itertools.izip(a_file, b_file):
     queue.put(pair) # blocks here on full queue

You can set up a pool of processes pulling items from the queue and taking action on each, assuming your problem can be parallelised this way.

I'd change your while condition to the following so that it doesn't fail when a has more lines than b.

while a_line and b_line

Otherwise, that looks good. You are reading in the two lines that you need, then processing. You could even multithread this by reading in N pairs of line and sending each pair off to a new thread or similar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM