Java vs C++ (g++) vs C++ (Visual Studio) performance

Question

EDIT: considering the first answer I removed the "myexp()" function as with bug and not the main point of the discussion

I have one simple piece of code and compiled for different platform and get different performance results (execution time):

Java 8 / Linux: 3.5 seconds
Execution Command: java -server Test
C++ / gcc 4.8.3: 6.22 seconds
Compilation options: O3
C++ / Visual Studio 2015: 1.7 seconds
Compiler Options: /Og /Ob2 /Oi

It seems that VS has these additional options not available for g++ compiler.

My question is: why is Visual Studio (with those compiler options) so faster with respect to both Java and C++ (with O3 optimization, which I believe is the most advanced)?

Below you can find both Java and C++ code.

C++ Code:

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <cmath>


static unsigned int g_seed;

//Used to seed the generator.
inline void fast_srand( int seed )
{
    g_seed = seed;
}

//fastrand routine returns one integer, similar output value range as C lib.
inline int fastrand()
{
    g_seed = ( 214013 * g_seed + 2531011 );
    return ( g_seed >> 16 ) & 0x7FFF;
}

int main()
{
    static const int NUM_RESULTS = 10000;
    static const int NUM_INPUTS  = 10000;

    double dInput[NUM_INPUTS];
    double dRes[NUM_RESULTS];

    fast_srand(10);

    clock_t begin = clock();

    for ( int i = 0; i < NUM_RESULTS; i++ )
    {
        dRes[i] = 0;

        for ( int j = 0; j < NUM_INPUTS; j++ )
        {
           dInput[j] = fastrand() * 1000;
           dInput[j] = log10( dInput[j] );
           dRes[i] += dInput[j];
        }
     }


    clock_t end = clock();

    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;

    printf( "Total execution time: %f sec - %f\n", elapsed_secs, dRes[0]);

    return 0;
}

Java Code:

import java.util.concurrent.TimeUnit;


public class Test
{

    static int g_seed;

    static void fast_srand( int seed )
    {
        g_seed = seed;
    }

    //fastrand routine returns one integer, similar output value range as C lib.
    static int fastrand()
    {
        g_seed = ( 214013 * g_seed + 2531011 );
        return ( g_seed >> 16 ) & 0x7FFF;
    }


    public static void main(String[] args)
    {
        final int NUM_RESULTS = 10000;
        final int NUM_INPUTS  = 10000;


        double[] dRes = new double[NUM_RESULTS];
        double[] dInput = new double[NUM_INPUTS];


        fast_srand(10);

        long nStartTime = System.nanoTime();

        for ( int i = 0; i < NUM_RESULTS; i++ )
        {
            dRes[i] = 0;

            for ( int j = 0; j < NUM_INPUTS; j++ )
            {
               dInput[j] = fastrand() * 1000;
               dInput[j] = Math.log( dInput[j] );
               dRes[i] += dInput[j];
            }
        }

        long nDifference = System.nanoTime() - nStartTime;

        System.out.printf( "Total execution time: %f sec - %f\n", TimeUnit.NANOSECONDS.toMillis(nDifference) / 1000.0, dRes[0]);
    }
}

Answer 1

The function

static inline double myexp( double val )
{
    const long tmp = (long)( 1512775 * val + 1072632447 );
    return double( tmp << 32 );
}:

gives the warning in MSVC

warning C4293: '<<' : shift count negative or too big, undefined behavior

After changing to:

static inline double myexp(double val)
{
    const long long tmp = (long long)(1512775 * val + 1072632447);
    return double(tmp << 32);
}

the code also takes around 4 secs in MSVC.

So, apparently the MSVC optimized a whole lot of stuff out there, possibly the entire myexp() function (and maybe even something else depending on this result as well) - because it can (remember, undefined behavior).

The lesson taken: Check (and fix) the warnings as well.

Note that if I try to print the result in the func, the MSVC optimized version gives me (for every call):

tmp: -2147483648
result: 0.000000

Ie the MSVC optimized the undefined behavior to always return 0. Might be also interesting to see the assembly output to see what else has been optimized out because of this.

So, after checking the assembly, the fixed version has this code:

; 52   :             dInput[j] = myexp(dInput[j]);
; 53   :             dInput[j] = log10(dInput[j]);

    mov eax, esi
    shr eax, 16                 ; 00000010H
    and eax, 32767              ; 00007fffH
    imul    eax, eax, 1000
    movd    xmm0, eax
    cvtdq2pd xmm0, xmm0
    mulsd   xmm0, QWORD PTR __real@4137154700000000
    addsd   xmm0, QWORD PTR __real@41cff7893f800000
    call    __dtol3
    mov edx, eax
    xor ecx, ecx
    call    __ltod3
    call    __libm_sse2_log10_precise

; 54   :             dRes[i] += dInput[j];

In the original version, this entire block is missing, ie the call to log10() has been apparently optimized out as well, and replaced by a constant at the end (apparently -INF , which is result of log10(0.0) - in the fact the result might be also undefined or implementation defined). Also, the entire myexp() function was replaced by fldz instruction (basically, "load zero"). So that explains the extra speed :)

EDIT

Regarding the performance difference when using the real exp() : The assembly output might give some clues.

In particular, for MSVC you can utilize those additional parameters:

/FAs /Qvec-report:2

/FAs produces the assembly listing (along with the source code)

/Qvec-report:2 provides useful information about the vectorization status:

test.cpp(49) : info C5002: loop not vectorized due to reason '1304'
test.cpp(45) : info C5002: loop not vectorized due to reason '1106'

The reason codes are available here: https://msdn.microsoft.com/en-us/library/jj658585.aspx - in particular, the MSVC seems to not be able to vectorize the loops properly. But according to the assembly listing, it still uses the SSE2 functions (which is still kind of "vectorization", improving the speed significantly).

The similar parameters for GCC are:

-funroll-loops -ftree-vectorizer-verbose=1

Which gives the result for me:

Analyzing loop at test.cpp:42
Analyzing loop at test.cpp:46
test.cpp:30: note: vectorized 0 loops in function.
test.cpp:46: note: Unroll loop 3 times

So apparently g++ is not able to vectorize either, but it does loop unrolling (in the assembly I can see that the loop code is duplicated 3 times there), which can also explain the better performance.

Unfortunately, this is where Java lacks AFAIK, because Java does not do any vectorization, SSE2 or loop unrolling, therefore it is then much slower than the optimized C++ version. See eg here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions? where the JNI is recommended for better performance (ie, calculating in C/C++ DLL through JNI interface for the Java app).

Java vs C++ (g++) vs C++ (Visual Studio) performance

Question

1 answers

solution1
5 ACCPTED 2016-12-14 10:51:41

EDIT

Java vs C++ (g++) vs C++ (Visual Studio) performance

Question

1 answers

solution1 5 ACCPTED 2016-12-14 10:51:41

EDIT

solution1
5 ACCPTED 2016-12-14 10:51:41