简体   繁体   中英

Why is this C++ code execution so slow compared to java?

I recently wrote a computation-intensive algorithm in Java, and then translated it to C++. To my surprise the C++ executed considerably slower. I have now written a much shorter Java test program, and a corresponding C++ program - see below. My original code featured a lot of array access, as does the test code. The C++ takes 5.5 times longer to execute (see comment at end of each program).

Conclusions after 1 st 21 comments below ...

Test code:

  1. g++ -o ... Java 5.5 times faster
  2. g++ -O3 -o ... Java 2.9 times faster
  3. g++ -fprofile-generate -march=native -O3 -o ... (run, then g++ -fprofile-use etc) Java 1.07 times faster.

My original project (much more complex than test code):

  1. Java 1.8 times faster
  2. C++ 1.9 times faster
  3. C++ 2 times faster
Software environment:
    Ubuntu 16.04 (64 bit).
    Netbeans 8.2 / jdk 8u121 (java code executed inside netbeans)
    g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
    Compilation: g++ -o cpp_test cpp_test.cpp

Java code:

public class JavaTest {
    public static void main(String[] args) {
        final int ARRAY_LENGTH = 100;
        final int FINISH_TRIGGER = 100000000;
        int[] intArray = new int[ARRAY_LENGTH];
        for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
        int i = 0;
        boolean finished = false;
        long loopCount = 0;
        System.out.println("Start");
        long startTime = System.nanoTime();
        while (!finished) {
            loopCount++;
            intArray[i]++;
            if (intArray[i] >= FINISH_TRIGGER) finished = true;
            else if (i <(ARRAY_LENGTH - 1)) i++;
            else i = 0;
        }
        System.out.println("Finish: " + loopCount + " loops; " +
            ((System.nanoTime() - startTime)/1e9) + " secs");
        // 5 executions in range 5.98 - 6.17 secs (each 9999999801 loops)
    }
}

C++ code:

//cpp_test.cpp:
#include <iostream>
#include <sys/time.h>
int main() {
    const int ARRAY_LENGTH = 100;
    const int FINISH_TRIGGER = 100000000;
    int *intArray = new int[ARRAY_LENGTH];
    for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
    int i = 0;
    bool finished = false;
    long long loopCount = 0;
    std::cout << "Start\n";
    timespec ts;
    clock_gettime(CLOCK_REALTIME, &ts);
    long long startTime = (1000000000*ts.tv_sec) + ts.tv_nsec;
    while (!finished) {
        loopCount++;
        intArray[i]++;
        if (intArray[i] >= FINISH_TRIGGER) finished = true;
        else if (i < (ARRAY_LENGTH - 1)) i++;
        else i = 0;
    }
    clock_gettime(CLOCK_REALTIME, &ts);
    double elapsedTime =
        ((1000000000*ts.tv_sec) + ts.tv_nsec - startTime)/1e9;
    std::cout << "Finish: " << loopCount << " loops; ";
    std::cout << elapsedTime << " secs\n";
    // 5 executions in range 33.07 - 33.45 secs (each 9999999801 loops)
}

The only time I could get the C++ program to outperform Java was when using profiling information. This shows that there's something in the runtime information (that Java gets by default) that allows for faster execution.

There's not much going on in your program apart from a non-trivial if statement. That is, without analysing the entire program, it's hard to predict which branch is most likely. This leads me to believe that this is a branch misprediction issue. Modern CPUs do instruction pipelining which allows for higher CPU throughput. However, this requires a prediction of what the next instructions to execute are. If the guess is wrong, the instruction pipeline must be cleared out, and the correct instructions loaded in (which takes time).

At compile time, the compiler doesn't have enough information to predict which branch is most likely. CPUs do a bit of branch prediction as well, but this is generally along the lines of loops loop and ifs if (rather than else).

Java, however, has the advantage of being able to use information at runtime as well as compile time. This allows Java to identify the middle branch as the one that occurs most frequently and so have this branch predicted for the pipeline.

Somehow both GCC and clang fail to unroll this loop and pull out the invariants even in -O3 and -Os, but Java does.

Java's final JITted assembly code is similar to this (in reality repeated twice ):

    while (true) {
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) break;
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) break;
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) break;
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) { if (i >= ARRAY_LENGTH) i = 0; break; }
        if (i >= ARRAY_LENGTH) i = 0;
    }

With this loop I'm getting exact same timings (6.4s) between C++ and Java.

Why is this legal to do? Because ARRAY_LENGTH is 100, which is a multiple of 4. So i can only exceed 100 and be reset to 0 every 4 iterations.

This looks like an opportunity for improvement for GCC and clang; they fail to unroll loops for which the total number of iterations is unknown, but even if unrolling is forced, they fail to recognize parts of the loop that apply to only certain iterations.

Regarding your findings in a more complex code (aka real life): Java's optimizer is exceptionally good for small loops, a lot of thought has been put into that, but Java loses a lot of time on virtual calls and GC.

In the end it comes down to machine instructions running on a concrete architecture, whoever comes up with the best set, wins. Don't assume the compiler will "do the right thing", look and the generated code, profile, repeat.

For example, if you restructure your loop just a bit:

    while (!finished) {
        for (i=0; i<ARRAY_LENGTH; ++i) {
            loopCount++;
            if (++intArray[i] >= FINISH_TRIGGER) {
                finished=true;
                break;
            }
        }
    }

Then C++ will outperform Java (5.9s vs 6.4s). ( revised C++ assembly )

And if you can allow a slight overrun (increment more intArray elements after reaching the exit condition):

    while (!finished) {
        for (int i=0; i<ARRAY_LENGTH; ++i) {
            ++intArray[i];
        }
        loopCount+=ARRAY_LENGTH;
        for (int i=0; i<ARRAY_LENGTH; ++i) {
            if (intArray[i] >= FINISH_TRIGGER) {
                loopCount-=ARRAY_LENGTH-i-1;
                finished=true;
                break;
            }
        }
    }

Now clang is able to vectorize the loop and reaches the speed of 3.5s vs. Java's 4.8s (GCC is unfortunately still not able to vectorize it).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM