简体   繁体   中英

How to optimize merge sort?

I've two files of 1 GB each containing only numbers in sorted order. Now I know how to read the contents of the files and sort them using merge sort algorithm and output it into an another file but what I'm interested is to how to do this only using 100MB buffer size (I do not worry about the scratch space). For example one way is to read 50 MB chunks from both the files and sort it and as it is sorted I could read a new element and continue the process till I reach the end of both files (Can anyone give me any idea how to implement this).

Sounds like you only need to merge the numbers in your files, not sort them, since they're already sorted in each file. The merge part of merge sort is this:

function merge(left,right)
    var list result
    while length(left) > 0 or length(right) > 0
        if length(left) > 0 and length(right) > 0
            if first(left) ≤ first(right)
                append first(left) to result
                left = rest(left)
            else
                append first(right) to result
                right = rest(right)
        else if length(left) > 0
            append left to result
            break             
        else if length(right) > 0
            append right to result
            break
    end while
    return result

Now you can just read the first 50 MB of numbers from both files in two buffers, apply the merge algorithm, then when one of the buffers has been exhausted (all its numbers analysed), read another 50 MB from the needed file. There's no need to sort anything.

You just need a condition that checks when one of your buffers is empty. When it is, read more from the file that buffer is associated with.

Why not utilize the standard library?

#include <fstream>
#include <iterator>
#include <algorithm>

int main()
{
   std::ifstream in1("in1.txt");
   std::ifstream in2("in2.txt");
   std::ofstream ut("ut.txt");
   std::istream_iterator<int> in1_it(in1);
   std::istream_iterator<int> in2_it(in2);
   std::istream_iterator<int> in_end;
   std::ostream_iterator<int> ut_it(ut, "\n");

   std::merge(in1_it, in_end, in2_it, in_end, ut_it);
}

You probably want to read/write in reasonable chunks to avoid I/O overhead. So probably use three buffers of ~30M, input1, input2 and output.

Keep going until either one of the input buffers is empty or the output buffer is full, then read/write to refill/empty the empty/full buffer.

That way you are writing/reading large chunks of data from the disk.

Beyond that you need asynchronous I/O to read/write data while you are doing the sorting. But that's probably overkill.

Since you're only doing a merge, not a complete sort, it's just the basic merge loop. Purely sequential I/O. No need to worry about buffers. Picture a zipper on a jacket. It's that simple. (Note: it could be a lot faster if the numbers are in binary format in the files. Not only will the files be smaller, but the program will be I/O limited, and the numbers will be perfectly accurate.)

double GetNumberFromFile(FILE file){
  if (feof(file)){
    return BIGBIGNUMBER;
  }
  else {
    return ReadADouble(file);
  }
}

double A = GetNumberFromFile(AFILE);
double B = GetNumberFromFile(BFILE);
while (A < BIGBIGNUMBER && B < BIGBIGNUMBER){
  if (A < B){
    write A;
    A = GetNumberFromFile(AFILE);
  }
  else if (B < A){
    write B;
    B = GetNumberFromFile(BFILE);
  }
  else {
    write A;
    write B; // or not, if you want to eliminate duplicates
    A = GetNumberFromFile(AFILE);
    B = GetNumberFromFile(BFILE);
  }
}
while (A < BIGBIGNUMBER){
    write A;
    A = GetNumberFromFile(AFILE);
}
while (B < BIGBIGNUMBER){
    write B;
    B = GetNumberFromFile(BFILE);
}

Responding to your question, consider a simpler problem, copying one file to another. You're only doing sequential I/O, which the file system is really good at. You write a simple loop to read small units like a byte or int from from file, and write it to the other. As soon as you try to read a byte, the system allocates a nice big buffer, swipes a big chunk of the file into the buffer, and then feeds you the byte out of the buffer. It keeps doing that until you need another buffer, when it invisibly gloms another one for you. The same sort of thing happens with the file you are writing. Now the CPU is pretty quick, so it can iterate through the input bytes, copying them to the output, in a fraction of the time it takes to read or write a buffer, because the reading or writing can't go any faster than the external hardware. The only reason a larger buffer would help is that part of the reading/writing time is what's called "latency", basically the time it takes to move the head to the desired track, and wait for the desired sector to come around. Most file systems break up the files into chunks that are sprinkled around the disk, so the head is jumping anyway. You can hear it.

The only difference between copying and a merge algorithm like yours is it's reading two files, not one. Either way, the basic time sequence is a series of buffer reads and writes interspersed with a small amount of CPU action. (It is possible to do overlapped I/O, so that the CPU action takes place while the I/O happens, so there is basically no delay between buffer reads and writes, but it was a bigger deal when CPUs were 1000 times slower.)

Of course, if you can arrange it so that the files being read and written are all on separate physical disk drives, and the drives are not fragmented much, then the amount of head motion could be minimized, and larger buffers might help. But basically, with a simple program, you can pretty much expect the simple code to go about as fast as the disk can move data, and giant buffers might help, but not much.

Benchmark. Read value-by-value and block read. Feel the difference! =)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM