The task is to merge sort two big files (cannot fit in the memory). After doing a little bit research, it seems that it's pretty easy to do it using heapq.merge
import heapq
import contextlib
filenames=('data1.txt', 'data2.txt')
with contextlib.ExitStack() as stack:
files = [stack.enter_context(open(fn)) for fn in filenames]
with open('data', 'w') as f:
f.writelines(heapq.merge(*files))
The problem is that how to handle the empty lines in the files. For example:
Data1.txt:
apple
amazon
Data2.txt:
hello
today
world
Output:
apple
amazon
google
hello
today
world
My answer for not using heapq.merge:
def read_non_empty_line(input):
while True:
line = input.readline()
if line == "":
return ""
if line.isspace() == False:
return line.strip()
#return line
def combine_sorted_files(file1, file2, output):
read_file1, read_file2 = True, True
with open(output,'w') as output_file:
with open(file1,'r') as input_file1:
with open(file2,'r') as input_file2:
while True:
if read_file1:
line1 = read_non_empty_line(input_file1)
if read_file2:
line2 = read_non_empty_line(input_file2)
if line1 == "" or line2 == "":
break
read_file1, read_file2 = False, False
if line1 < line2:
smaller = line1
read_file1 = True
else:
smaller = line2
read_file2 = True
output_file.write(smaller+"\n\n")
while line1 != "":
output_file.write(line1+"\n\n")
line1 = read_non_empty_line(input_file1)
while line2 != "":
output_file.write(line2+"\n\n")
line2 = read_non_empty_line(input_file2)
This problem also requests to optimize both memory and CPU utilization. Are there any suggestions?
If you want to use heapq.merge
while skipping blank lines, you can create your own generator function to handle the skip logic:
def iterate_non_blank_lines(file_iterator):
for line in file_iterator:
if line != "":
yield line
Note: I have simply checked for blank lines, but you could easily use a regular expression here to skip lines that contain only whitespace for example.
Then your code could be modified to use this generator:
filenames=('data1.txt', 'data2.txt')
with contextlib.ExitStack() as stack:
files = [iterate_non_blank_lines(stack.enter_context(open(fn))) for fn in filenames]
with open('data', 'w') as f:
f.writelines(heapq.merge(*files))
Also, this question sounds a lot like a homework problem (appologies if it's not) and I would highly recommend implementing the merge logic yourself because it is a fun problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.