GCC生成的64位代码比32位慢3倍

Question

I've noticed that my code runs on 64 bit Linux much slower than on 32 bit Linux or 64 bit Window or 64 bit Mac. 我注意到我的代码在64位Linux上运行比在32位Linux或64位Window或64位Mac上慢得多。 This is minimal test case. 这是最小的测试用例。

#include <stdlib.h>

typedef unsigned char UINT8;

void
stretch(UINT8 * lineOut, UINT8 * lineIn, int xsize, float *kk)
{
    int xx, x;

    for (xx = 0; xx < xsize; xx++) {
        float ss = 0.0;
        for (x = 0; x < xsize; x++) {
            ss += lineIn[x] * kk[x];
        }
        lineOut[xx] = (UINT8) ss;
    }
}

int
main( int argc, char** argv )
{
    int i;
    int xsize = 2048;

    UINT8 *lineIn = calloc(xsize, sizeof(UINT8));
    UINT8 *lineOut = calloc(xsize, sizeof(UINT8));
    float *kk = calloc(xsize, sizeof(float));

    for (i = 0; i < 1024; i++) {
        stretch(lineOut, lineIn, xsize, kk);
    }

    return 0;
}

And there is how it runs: 它有如何运行：

$ cc --version
cc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
$ cc -O2 -Wall -m64 ./tt.c -o ./tt && time ./tt
user  14.166s
$ cc -O2 -Wall -m32 ./tt.c -o ./tt && time ./tt
user  5.018s

As you can see, 32 bit version runs almost 3 times faster (I've tested both on 32bit and 64bit Ubuntu, result the same). 正如您所看到的，32位版本的运行速度快了近3倍（我在32位和64位Ubuntu上测试过，结果相同）。 And even more strange what performance depends on C standard: 更奇怪的是性能取决于C标准：

$ cc -O2 -Wall -std=c99 -m32 ./tt.c -o ./tt && time ./tt
user  15.825s
$ cc -O2 -Wall -std=gnu99 -m32 ./tt.c -o ./tt && time ./tt
user  5.090s

How can it be? 怎么会这样？ How can I workaround this to speed up 64 bit version generated by GCC. 我该如何解决这个问题来加速GCC生成的64位版本。

Update 1 更新1

I've compared assembler produced by fast 32 bit (default and gnu99) and slow (c99) and found following: 我比较了快速32位（默认和gnu99）和慢速（c99）生成的汇编程序，发现如下：

.L5:
  movzbl    (%ebx,%eax), %edx   # MEM[base: lineIn_10(D), index: _72, offset: 0B], D.1543
  movl  %edx, (%esp)    # D.1543,
  fildl (%esp)  #
  fmuls (%esi,%eax,4)   # MEM[base: kk_18(D), index: _72, step: 4, offset: 0B]
  addl  $1, %eax    #, x
  cmpl  %ecx, %eax  # xsize, x
  faddp %st, %st(1) #,
  fstps 12(%esp)    #
  flds  12(%esp)    #
  jne   .L5 #,

There is no fstps and flds commands in fast case. 快速情况下没有fstps和flds命令。 So GCC stores and loads value from memory on each step. 所以GCC在每一步都存储并加载内存中的值。 I've tried register float type, but this doesn't help. 我试过register float类型，但这没有用。

Update 2 更新2

I've tested on gcc-4.9 and looks like it generates optimal code for 64 bit. 我已经在gcc-4.9上进行了测试，看起来它为64位生成了最佳代码。 And -ffast-math (suggested by @jch) fixes -m32 -std=c99 for both GCC versions. 并且-ffast-math （由@jch建议）修复了两个GCC版本的-m32 -std=c99 。 I'm still looking for solution for 64 bit on gcc-4.8, because it is more common version for now that 4.9. 我仍然在gcc-4.8上寻找64位的解决方案，因为它现在更常见的是4.9。

Answer 1

There is a partial dependency stall in the code generated by older versions of GCC. 旧版本的GCC生成的代码中存在部分依赖性停顿。

movzbl (%rsi,%rax), %r8d
cvtsi2ss %r8d, %xmm0  ;; all upper bits in %xmm0 are false dependency

The dependency can be broken by xorps . xorps可以打破依赖关系。

#ifdef __SSE__
float __attribute__((always_inline)) i2f(int v) {
    float x;
    __asm__("xorps %0, %0; cvtsi2ss %1, %0" : "=x"(x) : "r"(v) );
    return x;
}
#else
float __attribute__((always_inline)) i2f(int v) { return (float) v; }
#endif

void stretch(UINT8* lineOut, UINT8* lineIn, int xsize, float *kk)
{
    int xx, x;

    for (xx = 0; xx < xsize; xx++) {
        float ss = 0.0;
        for (x = 0; x < xsize; x++) {
            ss += i2f(lineIn[x]) * kk[x];
        }
        lineOut[xx] = (UINT8) ss;
    }
}

Results 结果

$ cc -O2 -Wall -m64 ./test.c -o ./test64 && time ./test64
./test64  4.07s user 0.00s system 99% cpu 4.070 total
$ cc -O2 -Wall -m32 ./test.c -o ./test32 && time ./test32
./test32  3.94s user 0.00s system 99% cpu 3.938 total

Answer 2

Here is what I tried: I declared ss as volatile . 这是我尝试过的：我宣称ss是易变的 。 This prevented the compiler from doing optimizations on it. 这阻止了编译器对其进行优化。 I got similar times for both 32 and 64 bit versions. 我得到了32位和64位版本的类似时间。

64bit was slightly slower but this is normal because 64bit code is larger and the uCode cache has a finite size. 64位稍慢，但这是正常的，因为64位代码更大，uCode缓存的大小有限。 So in general 64bit should be very slightly slower than 32 (<3-4%). 所以一般来说64位应该比32（<3-4％）慢一点。

Getting back to the problem, I think that in 32bit mode the compiler makes more aggressive optimizations on ss. 回到这个问题，我认为在32位模式下，编译器会对ss进行更积极的优化。

Update 1: 更新1：

Looking at the 64bit code, it generates a CVTTSS2SI instruction, paired with a CVTSI2SS instruction for float to integer conversion. 查看64位代码，它会生成CVTTSS2SI指令，并与CVTSI2SS指令配对，以进行浮点到整数转换。 This has a higher latency. 这具有更高的延迟。 The 32bit code just uses a FMULS instruction, operating directly on floats. 32位代码只使用FMULS指令，直接在浮点数上运行。 Need to look for a compiler option to prevent these conversions. 需要查找编译器选项以防止这些转换。

Answer 3

In 32 bit mode, the compiler is making extra efforts to preserve strict IEEE 754 floating point semantics. 在32位模式下，编译器正在努力保留严格的IEEE 754浮点语义。 You can avoid this by compiling with -ffast-math : 您可以通过使用-ffast-math进行编译来避免这种情况：

$ gcc -m32 -O2 -std=c99 test.c && time ./a.out 

real    0m13.869s
user    0m13.884s
sys     0m0.000s
$ gcc -m32 -O2 -std=c99 -ffast-math test.c && time ./a.out 

real    0m4.477s
user    0m4.480s
sys     0m0.000s

I cannot reproduce your results in 64-bit mode, but I'm pretty confident that -ffast-math will solve your issues. 我无法在64位模式下重现您的结果，但我非常有信心-ffast-math将解决您的问题。 More generally, unless you really need reproducible IEEE 754 rounding behaviour, -ffast-math is what you want. 更一般地说，除非你真的需要可重现的IEEE 754舍入行为， -ffast-math就是你想要的。

Answer 4

Looks like a case for restrict. 看起来像限制的情况。 The three arrays can't overlap, can they? 三个阵列不能重叠，可以吗？

GCC生成的64位代码比32位慢3倍

问题描述

4 个解决方案

解决方案1
8 已采纳 2014-10-27 12:56:01

解决方案2
2 2014-10-27 11:36:15

解决方案3
2 2014-10-27 12:26:45

解决方案4
1 2014-10-27 11:18:15

GCC生成的64位代码比32位慢3倍

问题描述

4 个解决方案

解决方案1 8 已采纳 2014-10-27 12:56:01

解决方案2 2 2014-10-27 11:36:15

解决方案3 2 2014-10-27 12:26:45

解决方案4 1 2014-10-27 11:18:15

解决方案1
8 已采纳 2014-10-27 12:56:01

解决方案2
2 2014-10-27 11:36:15

解决方案3
2 2014-10-27 12:26:45

解决方案4
1 2014-10-27 11:18:15