简体   繁体   中英

Does gcc give worse/slower code using v* assembly instructions?

Consider this simple loop:

float f(float x[]) {
  float p = 1.0;
  for (int i = 0; i < 128; i++)
    p += x[i];
  return p;
}

If you compile it with -O2 -march=haswell in gcc you get:

    f:
            vmovss  xmm0, DWORD PTR .LC0[rip]
            lea     rax, [rdi+512]
    .L2:
            vaddss  xmm0, xmm0, DWORD PTR [rdi]
            add     rdi, 4
            cmp     rdi, rax
            jne     .L2
            ret
    .LC0:
            .long   1065353216

However, the Intel C Compiler gives:

f:
        xor       eax, eax                                      #3.3
        pxor      xmm0, xmm0                                    #2.11
        movaps    xmm7, xmm0                                    #2.11
        movaps    xmm6, xmm0                                    #2.11
        movaps    xmm5, xmm0                                    #2.11
        movaps    xmm4, xmm0                                    #2.11
        movaps    xmm3, xmm0                                    #2.11
        movaps    xmm2, xmm0                                    #2.11
        movaps    xmm1, xmm0                                    #2.11
..B1.2:                         # Preds ..B1.2 ..B1.1
        movups    xmm8, XMMWORD PTR [rdi+rax*4]                 #4.10
        movups    xmm9, XMMWORD PTR [16+rdi+rax*4]              #4.10
        movups    xmm10, XMMWORD PTR [32+rdi+rax*4]             #4.10
        movups    xmm11, XMMWORD PTR [48+rdi+rax*4]             #4.10
        movups    xmm12, XMMWORD PTR [64+rdi+rax*4]             #4.10
        movups    xmm13, XMMWORD PTR [80+rdi+rax*4]             #4.10
        movups    xmm14, XMMWORD PTR [96+rdi+rax*4]             #4.10
        movups    xmm15, XMMWORD PTR [112+rdi+rax*4]            #4.10
        addps     xmm0, xmm8                                    #4.5
        addps     xmm7, xmm9                                    #4.5
        addps     xmm6, xmm10                                   #4.5
        addps     xmm5, xmm11                                   #4.5
        addps     xmm4, xmm12                                   #4.5
        addps     xmm3, xmm13                                   #4.5
        addps     xmm2, xmm14                                   #4.5
        addps     xmm1, xmm15                                   #4.5
        add       rax, 32                                       #3.3
        cmp       rax, 128                                      #3.3
        jb        ..B1.2        # Prob 99%                      #3.3
        addps     xmm0, xmm7                                    #2.11
        addps     xmm6, xmm5                                    #2.11
        addps     xmm4, xmm3                                    #2.11
        addps     xmm2, xmm1                                    #2.11
        addps     xmm0, xmm6                                    #2.11
        addps     xmm4, xmm2                                    #2.11
        addps     xmm0, xmm4                                    #2.11
        movaps    xmm1, xmm0                                    #2.11
        movhlps   xmm1, xmm0                                    #2.11
        addps     xmm0, xmm1                                    #2.11
        movaps    xmm2, xmm0                                    #2.11
        shufps    xmm2, xmm0, 245                               #2.11
        addss     xmm0, xmm2                                    #2.11
        addss     xmm0, DWORD PTR .L_2il0floatpacket.0[rip]     #2.11
        ret                                                     #5.10
.L_2il0floatpacket.0:
        .long   0x3f800000

If we ignore the loop unrolling, the most obvious difference is that gcc using vaddss and icc uses addss.

Is there a performance difference between these two pieces of assembly and which one is better (ignoring the loop unrolling)?


The v prefix comes from the VEX coding scheme . It seems you can get icc to use these instructions by added -xavx as part of the command line flags. However, the question remains if there is any performance difference between the two sets of assembly in the question or if there is any advantage of one over the other.

The instructions with mnemonics prefixed v are VEX encoded instructions. The VEX encoding scheme allows for the encoding of every SSE instruction as well as the new AVX instructions and some other instructions. There is an almost 1:1 correspondence between legacy instructions and VEX encoded instructions with the following differences:

  • VEX encoded SSE instructions implicitly zero out the high 128 bit of the ymm register corresponding to the xmm register operand used in the instruction. This avoids a costly partial register update if a previous instruction left data in these bits.
  • The VEX encoding scheme allows instructions to have an additional output operand instead of overwriting one of the input operands. This reduces register pressure and allows the compiler to generate less data moves, slightly increasing performance.
  • AVX instructions can only be encoded with a VEX prefix because the 256 bit data width cannot be communicated by any other mean.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM