简体   繁体   中英

merge multiple sorted big files--python

The task is to merge sort two big files (cannot fit in the memory). After doing a little bit research, it seems that it's pretty easy to do it using heapq.merge

import heapq
import contextlib

filenames=('data1.txt', 'data2.txt')
with contextlib.ExitStack() as stack:
    files = [stack.enter_context(open(fn)) for fn in filenames]
    with open('data', 'w') as f:
        f.writelines(heapq.merge(*files))

The problem is that how to handle the empty lines in the files. For example:

Data1.txt:

apple

amazon

google

Data2.txt:

hello

today

world

Output:

apple 
amazon 
google 
hello 
today 
world

My answer for not using heapq.merge:

def read_non_empty_line(input):
    while True:
        line = input.readline()
        if line == "":
            return ""
        if line.isspace() == False:
            return line.strip()
    #return line

def combine_sorted_files(file1, file2, output):

    read_file1, read_file2 = True, True

    with open(output,'w') as output_file:
        with open(file1,'r') as input_file1:
            with open(file2,'r') as input_file2:
                while True:
                    if read_file1:
                        line1 = read_non_empty_line(input_file1)
                    if read_file2:
                        line2 = read_non_empty_line(input_file2)

                    if line1 == "" or line2 == "":
                        break

                    read_file1, read_file2 = False, False
                    if line1 < line2:
                        smaller = line1
                        read_file1 = True
                    else:
                        smaller = line2
                        read_file2 = True

                    output_file.write(smaller+"\n\n")

                while line1 != "":
                    output_file.write(line1+"\n\n")
                    line1 = read_non_empty_line(input_file1)
                while line2 != "":
                    output_file.write(line2+"\n\n")
                    line2 = read_non_empty_line(input_file2)

This problem also requests to optimize both memory and CPU utilization. Are there any suggestions?

If you want to use heapq.merge while skipping blank lines, you can create your own generator function to handle the skip logic:

def iterate_non_blank_lines(file_iterator):
    for line in file_iterator:
        if line != "":
            yield line

Note: I have simply checked for blank lines, but you could easily use a regular expression here to skip lines that contain only whitespace for example.

Then your code could be modified to use this generator:

filenames=('data1.txt', 'data2.txt')
with contextlib.ExitStack() as stack:
    files = [iterate_non_blank_lines(stack.enter_context(open(fn))) for fn in filenames]
    with open('data', 'w') as f:
        f.writelines(heapq.merge(*files))

Also, this question sounds a lot like a homework problem (appologies if it's not) and I would highly recommend implementing the merge logic yourself because it is a fun problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM