Comparing two files or why this code in Java is faster than C++?

Question

Why this code in Java is faster than C++? I need to compare 2 files byte by byte. For example when comparing two files size 650mb takes 40 seconds to C++ and 10 seconds for Java.

C++ code:

//bufferSize = 8mb
std::ifstream lFile(lFilePath.c_str(), std::ios::in | std::ios::binary);
std::ifstream rFile(rFilePath.c_str(), std::ios::in | std::ios::binary);

std::streamsize lReadBytesCount = 0;
std::streamsize rReadBytesCount = 0;

do {
    lFile.read(p_lBuffer, *bufferSize);
    rFile.read(p_rBuffer, *bufferSize);
    lReadBytesCount = lFile.gcount();
    rReadBytesCount = rFile.gcount();

    if (lReadBytesCount != rReadBytesCount ||
        std::memcmp(p_lBuffer, p_rBuffer, lReadBytesCount) != 0)
    {
        return false;
    }
} while (lFile.good() || rFile.good());

return true;

And Java code:

InputStream is1 = new BufferedInputStream(new FileInputStream(f1)); 
InputStream is2 = new BufferedInputStream(new FileInputStream(f2)); 

byte[] buffer1 = new byte[64];
byte[] buffer2 = new byte[64];

int readBytesCount1 = 0, readBytesCount2 = 0;

while (
    (readBytesCount1 = is1.read(buffer1)) != -1 &&
    (readBytesCount2 = is2.read(buffer2)) != -1
) {             
    if (Arrays.equals(buffer1, buffer2) && readBytesCount1 == readBytesCount2)
        countItr++;
    else {
        result = false
        break;
    }
}

Answer 1

One possible answer could be that the C++ code uses a buffer of 8 Mb, while the Java version uses 64 bytes. What happens if the difference is within the first few bytes? then the Java version needs to only read 64 bytes, to find the difference, while the C++ version needs to read 8 million. If you wish to compare them you should really use the same buffer size.

Further more, if the files are identical, there can be other reasons for the difference. Consider the time it takes to allocate 8 mb of data (this could even span across multiple pages), versus the time it takes to simply allocate 64 bytes. Since you are reading sequentially, the overhead is really on the memory side.

Answer 2

I just took your Java program and wrote an equivalent C++ program and both take nearly the same to compare two identical files, give or take a second.

One possible, trivial, explanation is that you ran the C++ program first and then the Java program. If this was your only test, the difference in execution time can be explained by just caching, although 40 seconds are a lot of time for reading 650 MB on today's hardware.

The data blocks are in the system file cache, and the second time there was no disk access to retrieve the files. In order to get comparable results, run the tests multiple times with the C++ and the Java program.

In your code, you have

lFile.read(p_lBuffer, *bufferSize);

which contradicts your comment at the beginning

//bufferSize = 8mb

so unless you show real complete code, anyone's guess is valid.

To eat my own dog food

#include <iostream>
#include <fstream>
#include <cstring>

const size_t N = 8 * 1024 * 1024;
char buf1[N], buf2[N];

int main(int argc, char **argv)
{
    std::iostream::sync_with_stdio(false);
    std::ifstream f1(argv[1]);
    std::ifstream f2(argv[2]);
    while (f1.read(buf1, sizeof(buf1)) && f2.read(buf2, sizeof(buf2))) {
        size_t n1 = f1.gcount(), n2 = f2.gcount();
        if (n1 != n2 || memcmp(buf1, buf2, n1) != 0)
            return 1;
    }

    return 0;
}

Answer 3

While the buffer size answer is a really good one, and likely very important, another possible source of your problem is using the iostream library. I would generally not use that library for this sort of work. One issue this could be causing, for example, is extra copying because of the buffering that iostream does for you. I would use the raw read and write calls.

For example, on a Linux C++11 platform I would do this:

#include <array>
#include <algorithm>
#include <string>
#include <stdexcept>

// Needed for open and close on a Linux platform
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

using ::std::string;

bool same_contents(const string &fname1, const string &fname2)
{
   int fd1 = ::open(fname1.c_str(), O_RDONLY);
   if (fd1 < 0) {
      throw ::std::runtime_error("Open of " + fname1 + " failed.");
   }
   int fd2 = ::open(fname2.c_str(), O_RDONLY);
   if (fd2 < 0) {
      ::close(fd1);
      fd1 = -1;
      throw ::std::runtime_error("Open of " + fname2 + " failed.");
   }

   bool same = true;
   try {
      ::std::array<char, 4096> buf1;
      ::std::array<char, 4096> buf2;
      bool done = false;

      while (!done) {
         int read1 = ::read(fd1, buf1.data(), buf1.size());
         if (read1 < 0) {
            throw ::std::runtime_error("Error reading " + fname1);
         }
         int read2 = ::read(fd2, buf2.data(), buf2.size());
         if (read2 < 0) {
            throw ::std::runtime_error("Error reading " + fname2);
         }
         if (read1 != read2) {
            same = false;
            done = true;
         }
         if (same && read1 > 0) {
            const auto compare_result = ::std::mismatch(buf1.begin(),
                                                        buf1.begin() + read1,
                                                        buf2.begin());
            if (compare_result.first != (buf1.begin() + read1)) {
               same = false;
            }
         }
         if (!same || (buf1.size() > read1)) {
            done = true;
         }
      }
   } catch (...) {
      if (fd1 >= 0) ::close(fd1);
      if (fd2 >= 0) ::close(fd2);
      throw;
   }
   if (fd1 >= 0) ::close(fd1);
   if (fd2 >= 0) ::close(fd2);
   return same;
}

Comparing two files or why this code in Java is faster than C++?

Question

3 answers

solution1
10 2013-03-02 17:23:38

solution2
1 2013-03-02 17:45:40

solution3
0 2013-03-02 17:30:19

Comparing two files or why this code in Java is faster than C++?

Question

3 answers

solution1 10 2013-03-02 17:23:38

solution2 1 2013-03-02 17:45:40

solution3 0 2013-03-02 17:30:19

solution1
10 2013-03-02 17:23:38

solution2
1 2013-03-02 17:45:40

solution3
0 2013-03-02 17:30:19