为什么其中一个比另一个快得多？

Question

I'm writing C++ code to find the first byte in memory that is non 0xFF. 我正在编写C ++代码来查找内存中非0xFF的第一个字节。 To exploit bitscanforward, I had written an inline assembly code that I like very much. 为了利用bitscanforward，我编写了一个我非常喜欢的内联汇编代码。 But for "readability" as well as future proofing (ie SIMD vectorization) I thought I would give g++ optimizer a chance. 但是对于“可读性”以及未来的校对（即SIMD矢量化），我想我会给g ++优化器一个机会。 g++ didn't vectorize, but it did get to nearly the same non-SIMD solution I did. g ++没有矢量化，但确实达到了我所做的几乎相同的非SIMD解决方案。 But for some reason, it's version runs much slower, 260000x slower (ie I have to loop my version 260,000x more to get to the same execution time). 但由于某种原因，它的版本运行速度慢得多，速度慢260000倍（即我必须循环我的版本260,000x才能达到相同的执行时间）。 I excepted some difference but not THAT much! 我除了一些差异，但不是那么多！ Can some point out why it might be? 有人可以指出它为什么会这样吗？ I just want to know so as to make a mistake in future inline assembly codes. 我只是想知道在未来的内联汇编代码中出错。

The C++ starting point is following, (in terms of counting accuracy, there is a bug in this code, but I've simplified it for this speed test): C ++的起点如下，（就计数准确性而言，此代码中存在一个错误，但我已将其简化为此速度测试）：

uint64_t count3 (const void *data, uint64_t const &nBytes) {
      uint64_t count = 0;
      uint64_t block;
      do {
         block = *(uint64_t*)(data+count);
         if ( block != (uint64_t)-1 ) {
/*       count += __builtin_ctz(~block);   ignore this for speed test*/
            goto done;
          };
        count += sizeof(block);
      } while ( count < nBytes );
done:
      return (count>nBytes ? nBytes : count);
}

The assembly code g++ came up with is: 汇编代码g ++提出的是：

_Z6count3PKvRKm:
.LFB33:
    .cfi_startproc
    mov rdx, QWORD PTR [rsi]
    xor eax, eax
    jmp .L19
    .p2align 4,,10
    .p2align 3
.L21:
    add rax, 8
    cmp rax, rdx
    jnb .L18
.L19:
    cmp QWORD PTR [rdi+rax], -1
    je  .L21
.L18:
    cmp rax, rdx
    cmova   rax, rdx
    ret
    .cfi_endproc

My inline assembly is 我的内联汇编是

_Z6count2PKvRKm:
.LFB32:
    .cfi_startproc
    push    rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    mov rbx, QWORD PTR [rsi]

    # count trailing bytes of 0xFF 
    xor     rax, rax  
.ctxff_loop_69:          
    mov     r9,  QWORD PTR [rdi+rax] 
    xor     r9, -1          
    jnz   .ctxff_final_69    
    add     rax, 8     
    cmp     rax, rbx 
    jl    .ctxff_loop_69    
.ctxff_final_69:         
    cmp     rax,rbx  
    cmova   rax,rbx  
    pop rbx
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

As far as I can see, it is substantially identical, except for the method by which it compare the data byte against 0xFF. 据我所知，除了将数据字节与0xFF进行比较的方法外，它基本相同。 But I cannot believe this would cause a great difference in computation time. 但我不敢相信这会导致计算时间的巨大差异。

It's conceivable my test method is causing the error, but all I do is change the function name and iteration length in the following, simple for-loop shown below: (when N is 1<<20, and all bytes of 'a' except the last byte is 0xFF) 可以想象我的测试方法导致错误，但我所做的只是更改下面的函数名称和迭代长度，简单的for循环如下所示:(当N为1 << 20时，'a'的所有字节除外最后一个字节是0xFF）

test 1 测试1

   for (uint64_t i=0; i < ((uint64_t)1<<15); i++) {
      n = count3(a,N);
   }

test 2 测试2

   for (uint64_t i=0; i < ((uint64_t)1<<33); i++) {
      n = count2(a,N);
   }

EDIT: 编辑：

Here are my real inline assembly codes with SSE count1() , x64-64 count() and then plain-old-c++ versions count0() and count3() . 这是我真正的内联汇编代码，包含SSE count1() ，x64-64 count() ，然后是plain-old-c ++版本count0()和count3() 。 I fell down this rabbit hole hoping that I could get g++ to take my count0() and arrive, on it's own, to my count1() or even count2() . 我摔倒了这个兔子洞，希望我能得到g ++来获取我的count0()并自己到达我的count1()甚至count2() 。 But alas it did nothing, absolutely no optmization :( I should add that my platform doesn't have AVX2, which is why I was hoping to get g++ to automatically vectorize, so that the code would automatically update when I update my platform. 但是它没有做任何事，绝对没有优化:(我应该补充一点，我的平台没有AVX2，这就是为什么我希望让g ++自动进行矢量化，这样代码会在我更新平台时自动更新。

In terms of the explicit register use in the inline assembly, if I didn't make them explicitly, g++ would reuse the same registers for nBytes and count . 就内联汇编中的显式寄存器使用而言，如果我没有明确地使用它们，g ++将为nBytes和count重用相同的寄存器。

In terms of speedup, between XMM and QWORD, I found the real benefit is simply the "loop-unroll" effect, which I replicate in count2() . 就加速而言，在XMM和QWORD之间，我发现真正的好处就是“循环展开”效应，我在count2()复制。

uint32_t count0(const uint8_t *data, uint64_t const &nBytes) {

  for (int i=0; i<nBytes; i++)
    if (data[i] != 0xFF) return i;

  return nBytes;
}
uint32_t count1(const void *data, uint64_t const &nBytes) {
  uint64_t count;
  __asm__("# count trailing bytes of 0xFF \n"
    "   xor     %[count], %[count]  \n"
    " vpcmpeqb  xmm0, xmm0, xmm0  \n" // make array of 0xFF

    ".ctxff_next_block_%=:        \n"
    " vpcmpeqb  xmm1, xmm0, XMMWORD PTR [%[data]+%[count]]  \n"
    " vpmovmskb r9, xmm1         \n"
    " xor     r9, 0xFFFF       \n" // test if all match (bonus negate r9)
    " jnz   .ctxff_tzc_%=        \n" // if !=0, STOP & tzcnt negated r9
    " add     %[count], 16       \n" // else inc
    " cmp     %[count], %[nBytes] \n"
    " jl    .ctxff_next_block_%=  \n" // while count < nBytes, loop
    " jmp   .ctxff_done_%=      \n" // else done + ALL bytes were 0xFF

    ".ctxff_tzc_%=:           \n"
    " tzcnt   r9, r9          \n" // count bytes up to non-0xFF
    " add     %[count], r9    \n"

    ".ctxff_done_%=:          \n" // more than 'nBytes' could be tested,
    " cmp     %[count],%[nBytes]  \n" // find minimum
    " cmova   %[count],%[nBytes]  "
    : [count] "=a" (count)
    : [nBytes] "b" (nBytes), [data] "d" (data)
    : "r9", "xmm0", "xmm1"
  );
  return count;
};

uint64_t count2 (const void *data, uint64_t const &nBytes) {
    uint64_t count;
  __asm__("# count trailing bytes of 0xFF \n"
    "    xor     %[count], %[count]  \n"

    ".ctxff_loop_%=:          \n"
    "    mov     r9,  QWORD PTR [%[data]+%[count]] \n"
    "    xor     r9, -1          \n" 
    "    jnz   .ctxff_final_%=    \n"
    "    add     %[count], 8     \n" 
    "    mov     r9,  QWORD PTR [%[data]+%[count]] \n"  // <--loop-unroll
    "    xor     r9, -1          \n" 
    "    jnz   .ctxff_final_%=    \n"
    "    add     %[count], 8     \n" 
    "    cmp     %[count], %[nBytes] \n"
    "    jl    .ctxff_loop_%=    \n"
    "    jmp   .ctxff_done_%=   \n" 

    ".ctxff_final_%=:            \n"
    "    bsf   r9,  r9           \n" // do tz count on r9 (either of first QWORD bits or XMM bytes)
    "    shr     r9,  3          \n" // scale BSF count accordiningly
    "    add     %[count], r9    \n"
    ".ctxff_done_%=:          \n" // more than 'nBytes' bytes could have been tested,
    "    cmp     %[count],%[nBytes]  \n" // find minimum of count and nBytes
    "    cmova   %[count],%[nBytes]  "
    : [count] "=a" (count)
    : [nBytes] "b" (nBytes), [data] "D" (data)
    : "r9"
  );
  return count;
}

inline static uint32_t tzcount(uint64_t const &qword) {
  uint64_t tzc;
  asm("tzcnt %0, %1" : "=r" (tzc) : "r" (qword) );
  return tzc;
};

uint64_t count3 (const void *data, uint64_t const &nBytes) {
      uint64_t count = 0;
      uint64_t block;
      do {
        block = *(uint64_t*)(data+count);
         if ( block != (uint64_t)-1 ) {
           count += tzcount(~block);
            goto done;
          };
        count += sizeof(block);
      } while ( count < nBytes );
done:
      return (count>nBytes ? nBytes : count);
}

uint32_t N = 1<<20;

int main(int argc, char **argv) {

  unsigned char a[N];
  __builtin_memset(a,0xFF,N);

  uint64_t n = 0, j;
   for (uint64_t i=0; i < ((uint64_t)1<<18); i++) {
      n += count2(a,N);
   }

 printf("\n\n %x %x %x\n",N, n, 0);   
  return n;
}

Answer 1

Answer to the question title 回答问题标题

Now that you've posted the full code: the call to count2(a,N) is hoisted out of the loop in main . 现在您已经发布了完整的代码： 对count2(a,N)的调用count2(a,N) main循环中提升出来 。 The run time still increases very slightly with the loop count (eg 1<<18 ), but all that loop is doing is a single add . 循环计数（例如1<<18 ）的运行时间仍然略有增加，但所有循环正在进行的是单个add 。 The compiler optimizes it to look more like this source: 编译器优化它看起来更像这个源：

uint64_t hoisted_count = count2(a,N);
for (uint64_t i=0; i < ((uint64_t)1<<18); i++) {
   n += hoisted_count;   // doesn't optimize to a multiply
}

There is no register conflict: %rax holds the result of the asm statement inlined from count2 . 没有寄存器冲突： %rax保存从count2内联的asm语句的结果。 It's then used as a source operand in the tiny loop that multiplies it by n through repeated addition. 然后它被用作微循环中的源操作数，通过重复添加将其乘以n 。

(see the asm on the Godbolt Compiler Explorer , and note all the compiler warnings about arithmetic on void* s: clang refuses to compile your code): （请参阅Godbolt Compiler Explorer上的asm，并注意有关void* s上的算术的所有编译器警告：clang拒绝编译代码）：

## the for() loop in main, when using count2()
.L23:
    addq    %rax, %r12
    subq    $1, %rdx
    jne     .L23

%rdx is the loop counter here, and %r12 is the accumulator that holds n . %rdx是循环计数器， %r12是保存n的累加器。 IDK why gcc doesn't optimize it to a constant-time multiply. IDK为什么gcc没有将它优化为恒定时间乘法。

Presumably the version that was 260k times slower didn't manage to hoist the whole count2 out of the loop. 据推测，速度低260k的版本无法将整个count2提升到循环之外。 From gcc's perspective, the inline asm version is much simpler: the asm statement is treated as a pure function of its inputs, and gcc doesn't even know anything about it touching memory. 从gcc的角度来看，内联asm版本要简单得多：asm语句被视为其输入的纯函数，而gcc甚至不知道它触及内存的任何信息。 The C version touches a bunch of memory, and is much more complicated to prove that it can be hoisted. C版本触及了大量内存，并且证明它可以被提升要复杂得多。

Using a "memory" clobber in the asm statement did prevent it from being hoisted when I checked on godbolt. 在asm语句中使用"memory" clobber确实阻止了当我检查godbolt时它被悬挂。 You can tell from the presence or absence of a branch target in main before the vector block. 您可以在向量块之前判断main是否存在分支目标。

But anyway, the run time will be something like n + rep_count vs. n * rep_count . 但无论如何， 运行时间将类似于n + rep_count与n * rep_count 。

The asm statement doesn't use a "memory" clobber or any memory inputs to tell gcc that it reads the memory pointed to by the input pointers. asm语句不使用"memory" clobber或任何内存输入来告诉gcc它读取输入指针指向的内存。 Incorrect optimizations could happen , eg being hoisted out of a loop that modified array elements. 可能会发生不正确的优化 ，例如，从修改数组元素的循环中提升。 (See the Clobbers section in the manual for an example of using a dummy anonymous struct memory input instead of a blanket "memory" clobber. Unfortunately I don't think that's usable when the block of memory doesn't have compile-time-constant size.) （请参阅手册中的Clobbers部分，了解使用虚拟匿名struct内存输入而不是全局 "memory" 修补程序的例子。不幸的是，当内存块没有编译时常量时，我不认为这是可用的尺寸。）

I think -fno-inline prevents hoisting because the function isn't marked with __attribute__((const)) or the slightly weaker __attribute__((pure)) to indicate no side-effects. 我认为-fno-inline可以防止挂起，因为函数没有用__attribute__((const))标记或稍弱的__attribute__((pure))来表示没有副作用。 After inlining, the optimizer can see that for the asm statement. 内联后，优化器可以看到asm语句。

count0 doesn't get optimized to anything good because gcc and clang can't auto-vectorize loops where the number of iterations isn't known at the start. count0没有得到任何好的优化，因为gcc和clang无法自动向量化循环，其中迭代次数在开始时是未知的。 ie they suck at stuff like strlen or memchr , or search loops in general, even if they're told that it's safe to access memory beyond the end of the point where the search loop exits early (eg using char buf[static 512] as a function arg). 即他们吮吸strlen或memchr类的东西，或者一般搜索循环，即使他们被告知在搜索循环早期退出的点之后访问内存也是安全的（例如使用char buf[static 512] as一个函数arg）。

Optimizations for your asm code: 对asm代码的优化：

Like I commented on the question, using xor reg, 0xFFFF / jnz is silly compared to cmp reg, 0xFFFF / jnz , because cmp/jcc can macro-fuse into a compare-and-branch uop. 就像我评论这个问题一样，使用xor reg, 0xFFFF / jnz与cmp reg, 0xFFFF / jnz相比是愚蠢的，因为cmp / jcc可以宏融合成比较和分支uop。 cmp reg, mem / jne can also macro-fuse, so the scalar version that does a load/xor/branch is using 3x as many uops per compare. cmp reg, mem / jne也可以进行宏融合，因此执行load / xor / branch的标量版本每次比较使用3x uop。 (Of course, Sandybridge can only micro-fuse the load if it doesn't use an indexed addressing mode. Also, SnB can only macro-fuse one pair per decode block, and but you'd probably get the first cmp/jcc and the loop branch to macro-fuse.) Anyway, the xor is a bad idea. （当然，如果不使用索引寻址模式，Sandybridge只能对负载进行微熔合。而且，SnB只能对每个解码块进行一对宏融合，但你可能会获得第一个cmp / jcc和循环分支到宏保险丝。）无论如何， xor是一个坏主意。 It's better to only xor right before the tzcnt , since saving uops in the loop is more important than code-size or uops total. 最好只在tzcnt之前xor ，因为在循环中保存tzcnt比代码大小或uops总数更重要。

Your scalar loop is 9 fused-domain uops, which is one too many to issue at one iteration per 2 clocks. 你的标量循环是9个融合域uops，这是每2个时钟在一次迭代中发出的太多。 (SnB is a 4-wide pipeline, and for tiny loops it can actually sustain that.) （SnB是一个4宽的管道，对于微小的环路，它实际上可以维持它。）

The indenting in the code in the first version of the question, with the count += __builtin_ctz at the same level as the if , made me think you were counting mismatch blocks, rather than just finding the first. 在问题的第一个版本的代码中缩进，其中count += __builtin_ctz与if处于同一级别，这让我觉得你在计算不匹配块，而不是仅仅找到第一个。

Unfortunately the asm code I wrote for the first version of this answer doesn't solve the same problem as the OP's updated and clearer code. 不幸的是，我为这个答案的第一个版本编写的asm代码并没有解决与OP更新和更清晰的代码相同的问题。 See an old version of this answer for SSE2 asm that counts 0xFF bytes using pcmpeqb/paddb, and psadbw for the horizontal sum to avoid wraparound. 请参阅SSE2 asm的旧答案，使用pcmpeqb / paddb计算0xFF字节，使用psadbw计算水平和以避免环绕。

Getting a speedup with SSE2 (or AVX): 使用SSE2（或AVX）获得加速：

Branching on the result of a pcmpeq takes many more uops than branching on a cmp . 对pcmpeq的结果进行分支比在cmp上分支需要更多的pcmpeq 。 If our search array is big, we can use a loop that tests multiple vectors at once, and then figure out which byte had our hit after breaking out of the loop. 如果我们的搜索数组很大，我们可以使用一次测试多个向量的循环，然后在断开循环后找出哪个字节有我们的命中。

This optimization applies to AVX2 as well. 此优化也适用于AVX2。

Here's my attempt, using GNU C inline asm with -masm=intel syntax. 这是我的尝试，使用GNU C inline asm和-masm=intel语法。 (Intrinsics might give better results, esp. when inlining, because the compiler understands intrinsics and so can do constant-propagation through them, and stuff like that. OTOH, you can often beat the compiler with hand-written asm if you understand the trade-offs and the microarchitecture you're targeting. Also, if you can safely make some assumptions, but you can't easily communicate them to the compiler.) （内在函数可能会提供更好的结果，尤其是在内联时，因为编译器理解内在函数，因此可以通过它们进行常量传播，以及类似的东西.OTOH，如果您了解交易，您通常可以使用手写asm来击败编译器-offs和你所针对的微体系结构。另外，如果你可以安全地做出一些假设，但你不能轻易地将它们传达给编译器。）

#include <stdint.h>
#include <immintrin.h>

// compile with -masm=intel
// len must be a multiple of 32  (TODO: cleanup loop)
// buf should be 16B-aligned for best performance
size_t find_first_zero_bit_avx1(const char *bitmap, size_t len) {
    // return size_t not uint64_t.  This same code works in 32bit mode, and in the x32 ABI where pointers are 32bit

    __m128i pattern, vtmp1, vtmp2;
    const char *result_pos;
    int tmpi;

    const char *bitmap_start = bitmap;

    asm (  // modifies the bitmap pointer, but we're inside a wrapper function
      "vpcmpeqw   %[pat], %[pat],%[pat]\n\t"          // all-ones

      ".p2align 4\n\t"   // force 16B loop alignment, for the benefit of CPUs without a loop buffer
      //IACA_START  // See the godbolt link for the macro definition
      ".Lcount_loop%=:\n\t"
//      "  movdqu    %[v1], [ %[p] ]\n\t"
//      "  pcmpeqb   %[v1], %[pat]\n\t"        // for AVX: fold the load into vpcmpeqb, making sure to still use a one-register addressing mode so it can micro-fuse
//      "  movdqu    %[v2], [ %[p] + 16 ]\n\t"
//      "  pcmpeqb   %[v2], %[pat]\n\t"

      "  vpcmpeqb  %[v1], %[pat], [ %[p] ]\n\t"  // Actually use AVX, to get a big speedup over the OP's scalar code on his SnB CPU
      "  vpcmpeqb  %[v2], %[pat], [ %[p] + 16 ]\n\t"

      "  vpand     %[v2], %[v2], %[v1]\n\t"         // combine the two results from this iteration
      "  vpmovmskb  %k[result], %[v2]\n\t"
      "  cmp       %k[result], 0xFFFF\n\t"          // k modifier: eax instead of rax
      "  jne     .Lfound%=\n\t"

      "  add       %[p], 32\n\t"
      "  cmp       %[p], %[endp]\n\t"              // this is only 2 uops after the previous cmp/jcc.  We could re-arrange the loop and put the branches farther apart if needed.  (e.g. start with a vpcmpeqb outside the loop, so each iteration actually sets up for the next)
      "  jb     .Lcount_loop%=\n\t"
      //IACA_END

      // any necessary code for the not-found case, e.g. bitmap = endp
      "  mov     %[result], %[endp]\n\t"
      "  jmp    .Lend%=\n\t"

      ".Lfound%=:\n\t"                       // we have to figure out which vector the first non-match was in, based on v1 and (v2&v1)
                                  // We could just search the bytes over again, but we don't have to.
                                  // we could also check v1 first and branch, instead of checking both and using a branchless check.
      "  xor       %k[result], 0xFFFF\n\t"
      "  tzcnt     %k[result], %k[result]\n\t"  // runs as bsf on older CPUs: same result for non-zero inputs, but different flags.  Faster than bsf on AMD
      "  add       %k[result], 16\n\t"          // result = byte count in case v1 is all-ones.  In that case, v2&v1 = v2

      "  vpmovmskb %k[tmp], %[v1]\n\t"
      "  xor       %k[tmp], 0xFFFF\n\t"
      "  bsf       %k[tmp], %k[tmp]\n\t"        // bsf sets ZF if its *input* was zero.  tzcnt's flag results are based on its output.  For AMD, it would be faster to use more insns (or a branchy strategy) and avoid bsf, but Intel has fast bsf.
      "  cmovnz    %k[result], %k[tmp]\n\t"     // if there was a non-match in v1, use it instead of tzcnt(v2)+16

      "  add       %[result], %[p]\n\t"         // If we needed to force 64bit, we could use %q[p].  But size_t should be 32bit in the x32 ABI, where pointers are 32bit.  This is one advantage to using size_t over uint64_t
      ".Lend%=:\n\t"
      : [result] "=&a" (result_pos),   // force compiler to pic eax/rax to save a couple bytes of code-size from the special cmp eax, imm32  and xor eax,imm32 encodings
        [p] "+&r" (bitmap),
        // throw-away outputs to let the compiler allocate registers.  All early-clobbered so they aren't put in the same reg as an input
        [tmp] "=&r" (tmpi),
        [pat] "=&x" (pattern),
        [v1] "=&x" (vtmp1), [v2] "=&x" (vtmp2)
      : [endp] "r" (bitmap+len)
        // doesn't compile: len isn't a compile-time constant
        // , "m" ( ({ struct { char x[len]; } *dummy = (typeof(dummy))bitmap ; *dummy; }) )  // tell the compiler *which* memory is an input.
      : "memory" // we read from data pointed to by bitmap, but bitmap[0..len] isn't an input, only the pointer.
    );

    return result_pos - bitmap_start;
}

This actually compiles and assembles to asm that looks like what I expected, but I didn't test it. 这实际上编译和汇编为asm，看起来像我的预期，但我没有测试它。 Note that it leaves all register allocation to the compiler, so it's more inlining-friendly. 请注意，它会将所有寄存器分配留给编译器，因此它更适合内联。 Even without inlining, it doesn't force use of a call-preserved register that has to get saved/restored (eg your use of a "b" constraint). 即使没有内联，它也不会强制使用必须保存/恢复的调用保留寄存器（例如，使用"b"约束）。

Not done: scalar code to handle the last sub-32B chunk of data. 未完成：用于处理最后一个32B子块数据的标量代码。

static perf analysis for Intel SnB-family CPUs based on Agner Fog's guides / tables . 基于Agner Fog指南/表格的英特尔SnB系列CPU的静态性能分析。 See also the x86 tag wiki. 另请参阅x86标记wiki。 I'm assuming we're not bottlenecked on cache throughput , so this analysis only applies when the data is hot in L2 cache, or maybe only L1 cache is fast enough. 我假设我们在缓存吞吐量方面没有瓶颈 ，因此这种分析仅适用于L2缓存中的数据热，或者只有L1缓存足够快。

This loop can issue out of the front-end at one iteration (two vectors) per 2 clocks, because it's 7 fused-domain uops. 这个循环可以每2个时钟在一次迭代（两个向量）中发出前端，因为它是7个融合域uops。 (The front-end issues in groups of 4). （前端问题分为4组）。 (It's probably actually 8 uops, if the two cmp/jcc pairs are decoded in the same block. Haswell and later can do two macro-fusions per decode group, but previous CPUs can only macro-fuse the first. We could software-pipeline the loop so the early-out branch is farther from the p < endp branch.) （如果两个cmp / jcc对在同一个块中解码，它实际上可能是8 uops。Haswell以后可以为每个解码组做两次宏融合，但以前的CPU只能将第一个宏融合。我们可以进行软件管道循环使早期分支离p <endp分支更远。）

All of these fused-domain uops include an ALU uop, so the bottleneck will be on ALU execution ports. 所有这些融合域uop包括ALU uop，因此瓶颈将在ALU执行端口上。 Haswell added a 4th ALU unit that can handle simple non-vector ops, including branches, so could run this loop at one iteration per 2 clocks (16B per clock). Haswell添加了第4个ALU单元，可以处理简单的非向量操作，包括分支，因此可以每2个时钟（每个时钟16B）以一次迭代运行此循环。 Your i5-2550k (mentioned in comments) is a SnB CPU. 您的i5-2550k（在评论中提到）是一个SnB CPU。

I used IACA to count uops per port, since it's time consuming to do it by hand. 我使用IACA来计算每个端口的uop数，因为手动执行这个操作非常耗时。 IACA is dumb and thinks there's some kind of inter-iteration dependency other than the loop counter, so I had to use -no_interiteration : IACA是愚蠢的，并认为除了循环计数器之外还有某种迭代间依赖性，所以我不得不使用-no_interiteration ：

g++ -masm=intel -Wall -Wextra -O3 -mtune=haswell find-first-zero-bit.cpp -c -DIACA_MARKS
iaca -64 -arch IVB -no_interiteration find-first-zero-bit.o

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - find-first-zero-bit.o
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 2.50 Cycles       Throughput Bottleneck: Port1, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 2.0    0.0  | 2.5  | 1.0    1.0  | 1.0    1.0  | 0.0  | 2.5  |
-------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   2^   |           | 1.0 | 1.0   1.0 |           |     |     | CP | vpcmpeqb xmm1, xmm0, xmmword ptr [rdx]
|   2^   |           | 0.6 |           | 1.0   1.0 |     | 0.4 | CP | vpcmpeqb xmm2, xmm0, xmmword ptr [rdx+0x10]
|   1    | 0.9       | 0.1 |           |           |     | 0.1 | CP | vpand xmm2, xmm2, xmm1
|   1    | 1.0       |     |           |           |     |     |    | vpmovmskb eax, xmm2
|   1    |           |     |           |           |     | 1.0 | CP | cmp eax, 0xffff
|   0F   |           |     |           |           |     |     |    | jnz 0x18
|   1    | 0.1       | 0.9 |           |           |     |     | CP | add rdx, 0x20
|   1    |           |     |           |           |     | 1.0 | CP | cmp rdx, rsi
|   0F   |           |     |           |           |     |     |    | jb 0xffffffffffffffe1

On SnB: pcmpeqb can run on p1/p5. 在SnB上： pcmpeqb可以在p1 / p5上运行。 Fused compare-and-branch can only run on p5. 融合比较和分支只能在p5上运行。 Non-fused cmp can run on p015. 非融合cmp可以在p015上运行。 Anyway, if one of the branches doesn't macro-fuse, the loop can run at one iteration per 8/3 = 2.666 cycles. 无论如何，如果其中一个分支没有宏融合，则循环可以每8/3 = 2.666个循环在一次迭代中运行。 With macro-fusion, best-case is 7/3 = 2.333 cycles. 通过宏观融合，最佳情况是7/3 = 2.333个周期。 (IACA doesn't try to simulate distribution of uops to ports exactly the way the hardware would dynamically make those decisions. However, we can't expect perfect scheduling from the hardware either, so 2 vectors per 2.5 cycles is probably reasonable with both macro-fusions happening. Uops that could have used port0 will sometimes steal port1 or port5, reducing throughput.) （IACA不会尝试模拟uops到端口的分布，就像硬件动态地做出这些决定一样。但是，我们不能指望硬件完美调度，因此每2.5个周期2个向量可能是合理的-fusions发生。可能使用port0的Uops有时会窃取port1或port5，从而降低吞吐量。）

As I said before, Haswell handles this loop better. 正如我之前所说，Haswell更好地处理这个循环。 IACA thinks HSW could run the loop at one iteration per 1.75c, but that's clearly wrong because the taken loop-branch ends the issue group. IACA认为HSW可以在每1.75c的一次迭代中运行循环，但这显然是错误的，因为所采用的循环分支结束了问题组。 It will issue in a repeating 4,3 uop pattern. 它将以重复的4,3 uop模式发布。 But the execution units can handle more throughput than the frontend for this loop, so it should really be able to keep up with the frontend on Haswell/Broadwell/Skylake and run at one iteration per 2 clocks. 但是执行单元可以处理比这个循环的前端更多的吞吐量，因此它应该能够跟上Haswell / Broadwell / Skylake的前端，并且每2个时钟运行一次迭代。

Further unrolling of more vpcmpeqb / vpand is only 2 uops per vector (or 3 without AVX, where we'd load into a scratch and then use that as the destination for pcmpeqb.) So with sufficient unrolling, we should be able to do 2 vector loads per clock. 进一步展开更多vpcmpeqb / vpand每个向量只有2个vpand （或3个没有AVX，我们将其加载到一个临时，然后将其用作pcmpeqb的目标。）因此，如果有足够的展开，我们应该能够做到2矢量加载每个时钟。 Without AVX, this wouldn't be possible without the PAND trick, since a vector load/compare/movmsk/test-and-branch is 4 uops. 没有AVX，如果没有PAND技巧，这是不可能的，因为矢量加载/比较/ movmsk /测试和分支是4微秒。 Bigger unrolls make more work to decode the final position where we found a match: a scalar cmp -based cleanup loop might be a good idea once we're in the area. 更大的展开会更多地解码我们找到匹配的最终位置：一旦我们进入该区域，基于标量cmp的清理循环可能是一个好主意。 You could maybe use the same scalar loop for cleanup of non-multiple-of-32B sizes. 您可以使用相同的标量循环来清除非多个32B大小。

If using SSE, with movdqu / pcmpeqb xmm,xmm , we can use an indexed addressing mode without it costing us uops, because a movdqu load is always a single load uop regardless of addressing mode. 如果使用SSE，使用movdqu / pcmpeqb xmm,xmm ，我们可以使用索引寻址模式，而不会花费我们movdqu ，因为无论寻址模式如何， movdqu加载始终是单个加载movdqu 。 (It doesn't need to micro-fuse with anything, unlike a store). （与商店不同，它不需要与任何东西微熔合）。 This lets us save a uop of loop overhead by using a base pointer pointing to the end of the array, and the index counting up from zero. 这允许我们通过使用指向数组末尾的基指针来保存循环开销，并且索引从零开始向上计数。 eg add %[idx], 32 / js to loop while the index is negative. 例如，在索引为负时add %[idx], 32 / js循环。

With AVX, however, we can save 2 uops by using a single-register addressing mode so vpcmpeqb %[v1], %[pat], [ %[p] + 16 ] can micro-fuse. 但是，使用AVX，我们可以使用单寄存器寻址模式节省2 vpcmpeqb %[v1], %[pat], [ %[p] + 16 ]因此vpcmpeqb %[v1], %[pat], [ %[p] + 16 ]可以微熔合。 This means we need the add/cmp/jcc loop structure I used in the example. 这意味着我们需要在示例中使用的add / cmp / jcc循环结构。 The same applies to AVX2. 这同样适用于AVX2。

Answer 2

So I think I found the problem. 所以我想我发现了这个问题。 I think one of the registers used in my inline assembly, despite the clobber list, was conflicting with g++ use of them, and was corrupting the test iteration. 我认为我的内联汇编中使用的寄存器之一，尽管有一个列表，但与g ++使用它们相冲突，并且破坏了测试迭代。 I fed g++ version of the code, back as an inline assembly code and got the same 260000x acceleration as my own. 我提供g ++版本的代码，作为内联汇编代码，并获得与我自己相同的260000x加速。 Also, in retrospect, the "accelerated" computation time was absurdly short. 此外，回想起来，“加速”计算时间是荒谬的。

Finally, I was so focus on the code embodied as a function that I failed to notice that g++ had, in fact, in-lined (i was using -O3 optimization) the function into the test for-loop as well. 最后，我非常关注体现为函数的代码，我没注意到g ++实际上已经将函数内联（我正在使用-O3优化）函数到for循环中。 When I forced g++ to not in-line (ie -fno-inline), the 260000x acceleration disappeared. 当我强迫g ++不在线时（即-fno-inline），260000x加速消失了。

I think g++ failed to take into account the inline assembly code's "clobber list" when it in-lined the entire function without my permission. 我认为g ++没有考虑到内联汇编代码的“clobber list”，当它在未经我许可的情况下整理整个函数时。

Lesson learned. 学过的知识。 I need to do better on inline assembly constraints or block inline-ing of the function with __attribute__ ((noinline)) 我需要__attribute__ ((noinline))联汇编约束或使用__attribute__ ((noinline))阻止函数内联时__attribute__ ((noinline))

EDIT: Definitely found that g++ is using rax for the main() for-loop counter, in conflict with my use of rax . 编辑：肯定发现g ++使用rax作为main（）for循环计数器，与我使用rax相冲突。

为什么其中一个比另一个快得多？

问题描述

2 个解决方案

解决方案1
6 2016-04-24 05:18:31

Answer to the question title 回答问题标题

Optimizations for your asm code: 对asm代码的优化：

Getting a speedup with SSE2 (or AVX): 使用SSE2（或AVX）获得加速：

解决方案2
2 2016-04-24 03:32:20

为什么其中一个比另一个快得多？

问题描述

2 个解决方案

解决方案1 6 2016-04-24 05:18:31

Answer to the question title 回答问题标题

Optimizations for your asm code: 对asm代码的优化：

Getting a speedup with SSE2 (or AVX): 使用SSE2（或AVX）获得加速：

解决方案2 2 2016-04-24 03:32:20

解决方案1
6 2016-04-24 05:18:31

解决方案2
2 2016-04-24 03:32:20