简体   繁体   English

如何优化合并排序?

[英]How to optimize merge sort?

I've two files of 1 GB each containing only numbers in sorted order. 我有两个1 GB的文件,每个文件只包含按排序顺序排列的数字。 Now I know how to read the contents of the files and sort them using merge sort algorithm and output it into an another file but what I'm interested is to how to do this only using 100MB buffer size (I do not worry about the scratch space). 现在我知道如何读取文件的内容并使用合并排序算法对它们进行排序并将其输出到另一个文件但我感兴趣的是如何只使用100MB缓冲区大小(我不担心划痕)空间)。 For example one way is to read 50 MB chunks from both the files and sort it and as it is sorted I could read a new element and continue the process till I reach the end of both files (Can anyone give me any idea how to implement this). 例如,一种方法是从两个文件中读取50 MB块并对其进行排序,并且在排序时我可以读取新元素并继续该过程,直到我到达两个文件的末尾(任何人都可以告诉我如何实现这个)。

Sounds like you only need to merge the numbers in your files, not sort them, since they're already sorted in each file. 听起来你只需要合并文件中的数字,而不是对它们进行排序,因为它们已经在每个文件中排序。 The merge part of merge sort is this: 合并排序merge部分是这样的:

function merge(left,right)
    var list result
    while length(left) > 0 or length(right) > 0
        if length(left) > 0 and length(right) > 0
            if first(left) ≤ first(right)
                append first(left) to result
                left = rest(left)
            else
                append first(right) to result
                right = rest(right)
        else if length(left) > 0
            append left to result
            break             
        else if length(right) > 0
            append right to result
            break
    end while
    return result

Now you can just read the first 50 MB of numbers from both files in two buffers, apply the merge algorithm, then when one of the buffers has been exhausted (all its numbers analysed), read another 50 MB from the needed file. 现在,您可以从两个缓冲区中的两个文件中读取前50 MB的数字,应用合并算法,然后当其中一个缓冲区已用尽(分析了所有数据)时,从所需文件中读取另外50 MB。 There's no need to sort anything. 没有必要对任何东西进行排序。

You just need a condition that checks when one of your buffers is empty. 您只需要一个条件来检查其中一个缓冲区是否为空。 When it is, read more from the file that buffer is associated with. 如果是,请从与缓冲区关联的文件中读取更多内容。

Why not utilize the standard library? 为什么不使用标准库?

#include <fstream>
#include <iterator>
#include <algorithm>

int main()
{
   std::ifstream in1("in1.txt");
   std::ifstream in2("in2.txt");
   std::ofstream ut("ut.txt");
   std::istream_iterator<int> in1_it(in1);
   std::istream_iterator<int> in2_it(in2);
   std::istream_iterator<int> in_end;
   std::ostream_iterator<int> ut_it(ut, "\n");

   std::merge(in1_it, in_end, in2_it, in_end, ut_it);
}

You probably want to read/write in reasonable chunks to avoid I/O overhead. 您可能希望以合理的块读/写以避免I / O开销。 So probably use three buffers of ~30M, input1, input2 and output. 所以可能使用~30M的三个缓冲区,input1,input2和output。

Keep going until either one of the input buffers is empty or the output buffer is full, then read/write to refill/empty the empty/full buffer. 继续前进,直到其中一个输入缓冲区为空或输出缓冲区已满,然后读/写以重新填充/清空空/满缓冲区。

That way you are writing/reading large chunks of data from the disk. 这样你就可以从磁盘写入/读取大块数据。

Beyond that you need asynchronous I/O to read/write data while you are doing the sorting. 除此之外,在进行排序时需要异步I / O来读/写数据。 But that's probably overkill. 但这可能是矫枉过正的。

Since you're only doing a merge, not a complete sort, it's just the basic merge loop. 由于您只进行合并,而不是完整的排序,它只是基本的合并循环。 Purely sequential I/O. 纯顺序I / O. No need to worry about buffers. 无需担心缓冲区。 Picture a zipper on a jacket. 想象一件夹克上的拉链。 It's that simple. 就这么简单。 (Note: it could be a lot faster if the numbers are in binary format in the files. Not only will the files be smaller, but the program will be I/O limited, and the numbers will be perfectly accurate.) (注意:如果文件中的数字是二进制格式,它可能会快得多。不仅文件会更小,而且程序将受I / O限制,而且数字将非常准确。)

double GetNumberFromFile(FILE file){
  if (feof(file)){
    return BIGBIGNUMBER;
  }
  else {
    return ReadADouble(file);
  }
}

double A = GetNumberFromFile(AFILE);
double B = GetNumberFromFile(BFILE);
while (A < BIGBIGNUMBER && B < BIGBIGNUMBER){
  if (A < B){
    write A;
    A = GetNumberFromFile(AFILE);
  }
  else if (B < A){
    write B;
    B = GetNumberFromFile(BFILE);
  }
  else {
    write A;
    write B; // or not, if you want to eliminate duplicates
    A = GetNumberFromFile(AFILE);
    B = GetNumberFromFile(BFILE);
  }
}
while (A < BIGBIGNUMBER){
    write A;
    A = GetNumberFromFile(AFILE);
}
while (B < BIGBIGNUMBER){
    write B;
    B = GetNumberFromFile(BFILE);
}

Responding to your question, consider a simpler problem, copying one file to another. 回答您的问题,考虑一个更简单的问题,将一个文件复制到另一个文件。 You're only doing sequential I/O, which the file system is really good at. 您只进行顺序I / O,文件系统非常擅长。 You write a simple loop to read small units like a byte or int from from file, and write it to the other. 您编写了一个简单的循环来从文件中读取像byte或int这样的小单元,并将其写入另一个。 As soon as you try to read a byte, the system allocates a nice big buffer, swipes a big chunk of the file into the buffer, and then feeds you the byte out of the buffer. 一旦你尝试读取一个字节,系统就会分配一个漂亮的大缓冲区,将一大块文件刷入缓冲区,然后将这个字节从缓冲区中提取出来。 It keeps doing that until you need another buffer, when it invisibly gloms another one for you. 它一直这样做,直到你需要另一个缓冲区,当它无形地为你创造另一个缓冲区时。 The same sort of thing happens with the file you are writing. 你正在编写的文件也会发生同样的事情。 Now the CPU is pretty quick, so it can iterate through the input bytes, copying them to the output, in a fraction of the time it takes to read or write a buffer, because the reading or writing can't go any faster than the external hardware. 现在CPU非常快,所以它可以迭代输入字节,将它们复制到输出,只需要读取或写入缓冲区所需的时间的一小部分,因为读取或写入不能比外部硬件。 The only reason a larger buffer would help is that part of the reading/writing time is what's called "latency", basically the time it takes to move the head to the desired track, and wait for the desired sector to come around. 更大缓冲区有用的唯一原因是读/写时间的一部分是所谓的“延迟”,基本上是将磁头移动到所需磁道所需的时间,并等待所需的扇区出现。 Most file systems break up the files into chunks that are sprinkled around the disk, so the head is jumping anyway. 大多数文件系统将文件分解为分散在磁盘周围的块,因此无论如何头部都在跳跃。 You can hear it. 你可以听到它。

The only difference between copying and a merge algorithm like yours is it's reading two files, not one. 复制和像你这样的合并算法之间的唯一区别是它读取两个文件,而不是一个。 Either way, the basic time sequence is a series of buffer reads and writes interspersed with a small amount of CPU action. 无论哪种方式,基本时间序列是一系列缓冲区读取和写入,散布着少量的CPU动作。 (It is possible to do overlapped I/O, so that the CPU action takes place while the I/O happens, so there is basically no delay between buffer reads and writes, but it was a bigger deal when CPUs were 1000 times slower.) (这是可以做到的重叠 I / O,使CPU动作发生 I / O发生的,所以基本上没有延迟之间缓冲区的读取和写入,但它是一个更大的交易时的CPU是1000倍慢。 )

Of course, if you can arrange it so that the files being read and written are all on separate physical disk drives, and the drives are not fragmented much, then the amount of head motion could be minimized, and larger buffers might help. 当然,如果您可以对其进行排列,使得正在读取和写入的文件都在不同的物理磁盘驱动器上,并且驱动器不会碎片太多,那么可以最大限度地减少磁头运动的数量,并且更大的缓冲区可能会有所帮助。 But basically, with a simple program, you can pretty much expect the simple code to go about as fast as the disk can move data, and giant buffers might help, but not much. 但基本上,通过一个简单的程序,您几乎可以期望简单的代码能够像磁盘移动数据一样快,而巨型缓冲区可能有所帮助,但并不多。

Benchmark. 基准。 Read value-by-value and block read. 读取值和块读取。 Feel the difference! 感到不同! =) =)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM