简体   繁体   English

使用Python heapq.merge对大文件进行排序

[英]Sort a big file with Python heapq.merge

I'm looking to complete such job but have encountered difficulty: 我想完成这样的工作,但遇到了困难:

I have a huge file of texts. 我有一大堆文本。 Each line is of the format "AGTCCCGGAT filename" where the first part is a DNA thing. 每行的格式为"AGTCCCGGAT filename" ,其中第一部分是DNA的东西。

The professor suggests that we break this huge file into many temporary files and use heapq.merge() to sort them. 教授建议我们将这个庞大的文件分成许多临时文件,并使用heapq.merge()对它们进行排序。 The goal is to have one file at the end which contains every line of the original file and is sorted. 目标是在末尾有一个文件,其中包含原始文件的每一行并进行排序。

My first try was to break each line into a separate temporary file. 我的第一次尝试是将每一行分成一个单独的临时文件。 The problem is that heapq.merge() reports there are too many files to sort. 问题是heapq.merge()报告要排序的文件太多。

My second try was to break it into temporary files by 50000 lines. 我的第二次尝试是将它分成50000行的临时文件。 The problem is that it seems that it does not sort by line, but by file. 问题是它似乎不是按行排序,而是按文件排序。 for example, we have something like: 例如,我们有类似的东西:

ACGTACGT filename
CGTACGTA filename
ACGTCCGT filename
CGTAAAAA filename

where the first two lines are from one temp file and the last two lines are from the second file. 前两行来自一个临时文件,后两行来自第二个文件。

My code to sort them is as follows: 我对它们进行排序的代码如下:

for line in heapq.merge(*[open('/var/tmp/L._Ipsum-strain01.fa_dir/'+str(f),'r') for f in os.listdir('/var/tmp/L._Ipsum-strain01.fa_dir')]):
     result.write(line)
result.close()

Your solution is almost correct. 您的解决方案几乎正确。 However, each partial file must be sorted before you write them to the disk. 但是,必须先每个部分文件进行排序,然后再将其写入磁盘。 Here's a 2-pass algorithm that demonstrates it: First, iterate the file in 50k line chunks, sort the lines in a chunk and then write this sorted chunk into a file. 这是一个2遍算法,用于演示它:首先,以50k行块的形式迭代文件, 块中的行进行排序 ,然后将这个已排序的块写入文件。 In second pass, open all these files and merge to the output file. 在第二遍中,打开所有这些文件并合并到输出文件。

from heapq import merge
from itertools import count, islice
from contextlib import ExitStack  # not available on Python 2
                                  # need to care for closing files otherwise

chunk_names = []

# chunk and sort
with open('input.txt') as input_file:
    for chunk_number in count(1):
        # read in next 50k lines and sort them
        sorted_chunk = sorted(islice(input_file, 50000))
        if not sorted_chunk:
            # end of input
            break

        chunk_name = 'chunk_{}.chk'.format(chunk_number)
        chunk_names.append(chunk_name)
        with open(chunk_name, 'w') as chunk_file:
            chunk_file.writelines(sorted_chunk)

with ExitStack() as stack, open('output.txt', 'w') as output_file:
    files = [stack.enter_context(open(chunk)) for chunk in chunk_names]
    output_file.writelines(merge(*files))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM