简体   繁体   中英

Implementing an external merge sort

I'm trying to learn Python and am working on making an external merge sort using an input file with ints. I'm using heapq.merge, and my code almost works, but it seems to be sorting my lines as strings instead of ints. If I try to convert to ints, writelines won't accept the data. Can anyone help me find an alternative? Additionally, am I correct in thinking this will allow me to sort a file bigger than memory (given adequate disk space)

import itertools
from itertools import islice
import tempfile
import heapq

#converts heapq.merge to ints
#def merge(*temp_files):
#    return heapq.merge(*[itertools.imap(int, s) for s in temp_files])

with open("path\to\input", "r") as f:
    temp_file = tempfile.TemporaryFile()
    temp_files = []
    elements = []
    while True:
        elements = list(islice(f, 1000))
        if not elements:
           break
        elements.sort(key=int)
        temp_files.append(elements)
        temp_file.writelines(elements)
        temp_file.flush()
        temp_file.seek(0)
        with open("path\to\output", "w") as output_file:
            output_file.writelines(heapq.merge(*temp_files))

Your elements are read by default as strings, you have to do something like:

elements = list(islice(f, 1000))
elements = [int(elem) for elem in elements]

so that they would be interpreted as integers instead.

That would also mean that you need to convert them back to strings when writing, eg:

temp_file.writelines([str(elem) for elem in elements])

Apart from that, you would need to convert your elements again to int for the final merging. In your case, you probably want to uncomment your merge method (and then convert the result back to strings again, same way as above).

Your code doesn't make much sense to me ( temp_files.append(elements) ? Merging inside the loop?), but here's a way to merge files sorting numerically:

import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
    out.writelines(map('{}\n'.format,
                       heapq.merge(*(map(int, f)
                                     for f in files))))

First the map(int, ...) turns each file's lines into ints. Then those get merged with heapq.merge . Then map('{}\\n'.format turns each of the integers back into a string, with newline. Then writelines writes those lines. In other words, you were already close, just had to convert the ints back to strings before writing them.

A different way to write it (might be clearer for some):

import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
    int_streams = (map(int, f) for f in files)
    int_stream = heapq.merge(*int_streams)
    line_stream = map('{}\n'.format, int_stream)
    out.writelines(line_stream)

And in any case, do use itertools.imap if you're using Python 2 as otherwise it'll read the whole files into memory at once. In Python 3, you can just use the normal map .

And yes, if you do it right, this will allow you to sort gigantic files with very little memory.

You are doing Kway merge within the loop which will add a lots of runtimeComplexity . Better Store the file handles into a spearate list and perform a Kway merge

You also don't have to remove and add new line back ,just sort it based on number.

sorted(temp_files,key=lambda no:int(no.strip()))

Rest of things are fine.

https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM