用 64 位替换 32 位循环计数器会在 Intel CPU 上使用 _mm_popcnt_u64 引入疯狂的性能偏差

Question

I was looking for the fastest way to popcount large arrays of data.我一直在寻找对大量数据进行popcount的最快方法。 I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.我遇到了一个非常奇怪的效果：将循环变量从unsigned更改为uint64_t使我的 PC 上的性能下降了 50%。

The Benchmark基准

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line.如您所见，我们创建了一个随机数据缓冲区，大小为x兆字节，其中x是从命令行读取的。 Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount.之后，我们遍历缓冲区并使用 x86 popcount内在函数的展开版本来执行 popcount。 To get a more precise result, we do the popcount 10,000 times.为了获得更精确的结果，我们进行了 10,000 次 popcount。 We measure the times for the popcount.我们测量popcount的时间。 In the upper case, the inner loop variable is unsigned , in the lower case, the inner loop variable is uint64_t .在大写中，内循环变量是unsigned ，在小写中，内循环变量是uint64_t 。 I thought that this should make no difference, but the opposite is the case.我认为这应该没有区别，但情况恰恰相反。

The (absolutely crazy) results （绝对疯狂的）结果

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):我是这样编译的（g++ 版本：Ubuntu 4.8.2-19ubuntu1）：

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data):以下是在我的Haswell Core i7-4770K CPU @ 3.50 GHz 上运行test 1 （因此 1 MB 随机数据）的结果：

unsigned 41959360000 0.401554 sec 26.113 GB/s无符号 41959360000 0.401554 秒26.113 GB/s
uint64_t 41959360000 0.759822 sec 13.8003 GB/s uint64_t 41959360000 0.759822 秒13.8003 GB/秒

As you see, the throughput of the uint64_t version is only half the one of the unsigned version!如您所见， uint64_t版本的吞吐量仅为unsigned版本的一半！ The problem seems to be that different assembly gets generated, but why?问题似乎是生成了不同的程序集，但为什么呢？ First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3):首先，我想到了一个编译器的bug，所以我尝试了clang++ （Ubuntu Clang version 3.4-1ubuntu3）：

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: test 1结果： test 1

unsigned 41959360000 0.398293 sec 26.3267 GB/s无符号 41959360000 0.398293 秒26.3267 GB/s
uint64_t 41959360000 0.680954 sec 15.3986 GB/s uint64_t 41959360000 0.680954 秒15.3986 GB/秒

So, it is almost the same result and is still strange.所以，这几乎是相同的结果，仍然很奇怪。 But now it gets super strange.但现在它变得非常奇怪。 I replace the buffer size that was read from input with a constant 1 , so I change:我用常量1替换了从输入读取的缓冲区大小，所以我改变了：

uint64_t size = atol(argv[1]) << 20;

to到

uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time.因此，编译器现在知道编译时的缓冲区大小。 Maybe it can add some optimizations!也许它可以添加一些优化！ Here are the numbers for g++ :以下是g++的数字：

unsigned 41959360000 0.509156 sec 20.5944 GB/s无符号 41959360000 0.509156 秒20.5944 GB/s
uint64_t 41959360000 0.508673 sec 20.6139 GB/s uint64_t 41959360000 0.508673 秒20.6139 GB/秒

Now, both versions are equally fast.现在，两个版本都同样快。 However, the unsigned got even slower !然而， unsigned变得更慢！ It dropped from 26 to 20 GB/s , thus replacing a non-constant by a constant value lead to a deoptimization .它从26 20 GB/s下降到20 GB/s ，因此用常数值替换非常量会导致去优化。 Seriously, I have no clue what is going on here!说真的，我不知道这里发生了什么！ But now to clang++ with the new version:但是现在要使用新版本的clang++ ：

unsigned 41959360000 0.677009 sec 15.4884 GB/s未签名 41959360000 0.677009 秒15.4884 GB/秒
uint64_t 41959360000 0.676909 sec 15.4906 GB/s uint64_t 41959360000 0.676909 秒15.4906 GB/秒

Wait, what?等等，什么？ Now, both versions dropped to the slow number of 15 GB/s.现在，两个版本都下降到 15 GB/s 的慢速数字。 Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!因此，在 Clang 的两种情况下，用常量值替换非常量甚至会导致代码变慢！

I asked a colleague with anIvy Bridge CPU to compile my benchmark.我请一位使用Ivy Bridge CPU 的同事来编译我的基准测试。 He got similar results, so it does not seem to be Haswell.他得到了类似的结果，所以似乎不是哈斯韦尔。 Because two compilers produce strange results here, it also does not seem to be a compiler bug.因为两个编译器在这里产生奇怪的结果，它似乎也不是编译器错误。 We do not have an AMD CPU here, so we could only test with Intel.我们这里没有 AMD CPU，所以我们只能用 Intel 进行测试。

More madness, please!更疯狂，拜托！

Take the first example (the one with atol(argv[1]) ) and put a static before the variable, ie:以第一个示例（带有atol(argv[1])示例）并在变量之前放置一个static变量，即：

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++:这是我在 g++ 中的结果：

unsigned 41959360000 0.396728 sec 26.4306 GB/s无符号 41959360000 0.396728 秒26.4306 GB/s
uint64_t 41959360000 0.509484 sec 20.5811 GB/s uint64_t 41959360000 0.509484 秒20.5811 GB/秒

Yay, yet another alternative .是的，还有另一种选择。 We still have the fast 26 GB/s with u32 , but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version!我们仍然有快26 GB / s的u32 ，但我们设法u64从13 GB至少/ S到20 GB / s的版本！ On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all.在我collegue的PC中， u64版本成为速度甚至超过了u32的版本，产生所有的最快的结果。 Sadly, this only works for g++ , clang++ does not seem to care about static .可悲的是，这只适用于g++ ， clang++似乎并不关心static 。

My question我的问题

Can you explain these results?你能解释一下这些结果吗？ Especially:尤其：

How can there be such a difference between u32 and u64 ?哪有之间的这种差异u32和u64 ？
How can replacing a non-constant by a constant buffer size trigger less optimal code ?如何用恒定缓冲区大小替换非常量触发不太理想的代码？
How can the insertion of the static keyword make the u64 loop faster? static关键字的插入如何让u64循环更快？ Even faster than the original code on my collegue's computer!甚至比我同事电脑上的原始代码还要快！

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally.我知道优化是一个棘手的领域，但是，我从未想过如此小的更改会导致执行时间的100% 差异，并且诸如恒定缓冲区大小之类的小因素可以再次完全混合结果。 Of course, I always want to have the version that is able to popcount 26 GB/s.当然，我一直希望有能够达到 26 GB/s 的版本。 The only reliable way I can think of is copy paste the assembly for this case and use inline assembly.我能想到的唯一可靠的方法是复制粘贴这种情况下的程序集并使用内联程序集。 This is the only way I can get rid of compilers that seem to go mad on small changes.这是我摆脱那些似乎因小改动而发疯的编译器的唯一方法。 What do you think?你怎么认为？ Is there another way to reliably get the code with most performance?有没有另一种方法可以可靠地获得最高性能的代码？

The Disassembly拆卸

Here is the disassembly for the various results:以下是各种结果的反汇编：

26 GB/s version from g++ / u32 / non-const bufsize :来自g++ / u32 / non-const bufsize 的26 GB/s 版本：

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 GB/s version from g++ / u64 / non-const bufsize :来自g++/u64/非常量 bufsize 的13 GB/s 版本：

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15 GB/s version from clang++ / u64 / non-const bufsize :来自clang++ / u64 / non-const bufsize 的15 GB/s 版本：

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20 GB/s version from g++ / u32&u64 / const bufsize :来自g++ / u32&u64 / const bufsize 的20 GB/s 版本：

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15 GB/s version from clang++ / u32&u64 / const bufsize :来自clang++ / u32&u64 / const bufsize 的15 GB/s 版本：

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

Interestingly, the fastest (26 GB/s) version is also the longest!有趣的是，最快（26 GB/s）的版本也是最长的！ It seems to be the only solution that uses lea .它似乎是唯一使用lea解决方案。 Some versions use jb to jump, others use jne .有些版本使用jb跳转，其他版本使用jne 。 But apart from that, all versions seem to be comparable.但除此之外，所有版本似乎都具有可比性。 I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly.我不知道 100% 的性能差距可能来自哪里，但我不太擅长破译程序集。 The slowest (13 GB/s) version looks even very short and good.最慢的 (13 GB/s) 版本看起来甚至非常短而且不错。 Can anyone explain this?谁能解释一下？

Lessons learned得到教训

No matter what the answer to this question will be;不管这个问题的答案是什么； I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code .我了解到，在真正的热循环中，每个细节都很重要，即使是那些似乎与热代码没有任何关联的细节。 I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference!我从来没有想过要为循环变量使用什么类型，但是正如您所见，如此微小的更改可以产生100% 的不同！ Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable!甚至缓冲区的存储类型也会产生巨大的差异，正如我们在 size 变量前插入static关键字所看到的！ In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.将来，在编写对系统性能至关重要的非常紧且热的循环时，我将始终在各种编译器上测试各种替代方案。

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times.有趣的是，尽管我已经展开了四次循环，但性能差异仍然如此之大。 So even if you unroll, you can still get hit by major performance deviations.因此，即使您展开，您仍然会受到主要性能偏差的影响。 Quite interesting.挺有意思。

Answer 1

Culprit: False Data Dependency (and the compiler isn't even aware of it)罪魁祸首：错误的数据依赖（编译器甚至不知道）

On Sandy/Ivy Bridge and Haswell processors, the instruction:在 Sandy/Ivy Bridge 和 Haswell 处理器上，指令：

popcnt  src, dest

appears to have a false dependency on the destination register dest .似乎对目标寄存器dest有错误的依赖性。 Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.即使指令只写入它，指令也会等到dest准备好后再执行。 This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)这种错误的依赖（现在）被英特尔记录为勘误表HSD146 (Haswell)和SKL029 (Skylake)

Skylake fixed this for lzcnt and tzcnt . Skylake 为lzcnt和tzcnt修复了这个tzcnt 。
Cannon Lake (and Ice Lake) fixed this for popcnt . Cannon Lake（和 Ice Lake）为popcnt修复了这个popcnt 。
bsf / bsr have a true output dependency: output unmodified for input=0. bsf / bsr具有真正的输出依赖性：输入=0 时未修改的输出。 (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.) （但是没有办法利用内在函数来利用它——只有 AMD 记录它并且编译器不公开它。）

(Yes, these instructions all run on the same execution unit ). （是的，这些指令都在同一个执行单元上运行）。

This dependency doesn't just hold up the 4 popcnt s from a single loop iteration.这种依赖不仅仅支持来自单个循环迭代的 4 个popcnt 。 It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.它可以进行循环迭代，使处理器无法并行化不同的循环迭代。

The unsigned vs. uint64_t and other tweaks don't directly affect the problem. unsigned vs. uint64_t和其他调整不会直接影响问题。 But they influence the register allocator which assigns the registers to the variables.但是它们会影响将寄存器分配给变量的寄存器分配器。

In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.在您的情况下，速度是粘在（错误）依赖链上的直接结果，具体取决于寄存器分配器决定做什么。

13 GB/s has a chain: popcnt - add - popcnt - popcnt → next iteration 13 GB/s 有一个链： popcnt - add - popcnt - popcnt → 下一次迭代
15 GB/s has a chain: popcnt - add - popcnt - add → next iteration 15 GB/s 有一个链： popcnt - add - popcnt - add → 下一次迭代
20 GB/s has a chain: popcnt - popcnt → next iteration 20 GB/s 有一个链： popcnt - popcnt → 下一次迭代
26 GB/s has a chain: popcnt - popcnt → next iteration 26 GB/s 有一个链： popcnt - popcnt → 下一次迭代

The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. 20 GB/s 和 26 GB/s 之间的差异似乎是间接寻址的小瑕疵。 Either way, the processor starts to hit other bottlenecks once you reach this speed.无论哪种方式，一旦您达到此速度，处理器就会开始遇到其他瓶颈。

To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want.为了测试这一点，我使用了内联程序集来绕过编译器并获得我想要的程序集。 I also split up the count variable to break all other dependencies that might mess with the benchmarks.我还拆分了count变量以打破所有其他可能与基准测试混乱的依赖关系。

Here are the results:结果如下：

Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom) Sandy Bridge Xeon @ 3.5 GHz：（完整的测试代码可以在底部找到）

GCC 4.6.3: g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native GCC 4.6.3： g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
Ubuntu 12 Ubuntu 12

Different Registers: 18.6195 GB/s不同的寄存器： 18.6195 GB/s

.L4:
    movq    (%rbx,%rax,8), %r8
    movq    8(%rbx,%rax,8), %r9
    movq    16(%rbx,%rax,8), %r10
    movq    24(%rbx,%rax,8), %r11
    addq    $4, %rax

    popcnt %r8, %r8
    add    %r8, %rdx
    popcnt %r9, %r9
    add    %r9, %rcx
    popcnt %r10, %r10
    add    %r10, %rdi
    popcnt %r11, %r11
    add    %r11, %rsi

    cmpq    $131072, %rax
    jne .L4

Same Register: 8.49272 GB/s同一寄存器： 8.49272 GB/s

.L9:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # This time reuse "rax" for all the popcnts.
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L9

Same Register with broken chain: 17.8869 GB/s与断链相同的寄存器： 17.8869 GB/s

.L14:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # Reuse "rax" for all the popcnts.
    xor    %rax, %rax    # Break the cross-iteration dependency by zeroing "rax".
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L14

So what went wrong with the compiler?那么编译器出了什么问题呢？

It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency.似乎 GCC 和 Visual Studio 都不知道popcnt有这样一个错误的依赖。 Nevertheless, these false dependencies aren't uncommon.然而，这些错误的依赖并不少见。 It's just a matter of whether the compiler is aware of it.这只是编译器是否意识到这一点的问题。

popcnt isn't exactly the most used instruction. popcnt并不是最常用的指令。 So it's not really a surprise that a major compiler could miss something like this.因此，主要编译器可能会错过这样的事情并不奇怪。 There also appears to be no documentation anywhere that mentions this problem.似乎也没有任何地方提到这个问题的文档。 If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.如果英特尔不公开它，那么外界不会知道，直到有人偶然遇到它。

( Update: As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.) （更新：从 4.9.2 版本开始，GCC 意识到这种错误依赖，并在启用优化时生成代码来补偿它。来自其他供应商的主要编译器，包括 Clang、MSVC，甚至英特尔自己的 ICC 还没有意识到这个微架构勘误并且不会发出补偿它的代码。）

Why does the CPU have such a false dependency?为什么CPU有这样一个虚假的依赖？

We can speculate: it runs on the same execution unit as bsf / bsr which do have an output dependency.我们可以推测：它与bsf / bsr运行在相同的执行单元上，后者确实具有输出依赖性。 ( How is POPCNT implemented in hardware? ). （ POPCNT 是如何在硬件中实现的？）。 For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified.对于这些指令，英特尔将 input=0 的整数结果记录为“未定义”（ZF=1），但英特尔硬件实际上提供了更强大的保证，以避免破坏旧软件：输出未修改。 AMD documents this behaviour. AMD 记录了这种行为。

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.据推测，让这个执行单元的一些 uops 依赖于输出而其他的则不方便。

AMD processors do not appear to have this false dependency. AMD 处理器似乎没有这种错误的依赖性。

The full test code is below for reference:完整的测试代码如下供参考：

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

   using namespace std;
   uint64_t size=1<<20;

   uint64_t* buffer = new uint64_t[size/8];
   char* charbuffer=reinterpret_cast<char*>(buffer);
   for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;

   uint64_t count,duration;
   chrono::time_point<chrono::system_clock> startP,endP;
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %4  \n\t"
                "add %4, %0     \n\t"
                "popcnt %5, %5  \n\t"
                "add %5, %1     \n\t"
                "popcnt %6, %6  \n\t"
                "add %6, %2     \n\t"
                "popcnt %7, %7  \n\t"
                "add %7, %3     \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Chain 4   \t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "xor %%rax, %%rax   \n\t"   // <--- Break the chain.
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Broken Chain\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }

   free(charbuffer);
}

An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si一个同样有趣的基准可以在这里找到： http : //pastebin.com/kbzgL8si
This benchmark varies the number of popcnt s that are in the (false) dependency chain.此基准测试会改变（假）依赖链中的popcnt的数量。

False Chain 0:  41959360000 0.57748 sec     18.1578 GB/s
False Chain 1:  41959360000 0.585398 sec    17.9122 GB/s
False Chain 2:  41959360000 0.645483 sec    16.2448 GB/s
False Chain 3:  41959360000 0.929718 sec    11.2784 GB/s
False Chain 4:  41959360000 1.23572 sec     8.48557 GB/s

Answer 2

I coded up an equivalent C program to experiment, and I can confirm this strange behaviour.我编写了一个等效的 C 程序进行实验，我可以确认这种奇怪的行为。 What's more, gcc believes the 64-bit integer (which should probably be a size_t anyway...) to be better, as using uint_fast32_t causes gcc to use a 64-bit uint.更重要的是， gcc认为 64 位整数（无论如何应该是size_t ...）更好，因为使用uint_fast32_t会导致 gcc 使用 64 位 uint。

I did a bit of mucking around with the assembly:我对程序集做了一些处理：
Simply take the 32-bit version, replace all 32-bit instructions/registers with the 64-bit version in the inner popcount-loop of the program.只需采用 32 位版本，在程序的内部 popcount 循环中将所有 32 位指令/寄存器替换为 64 位版本。 Observation: the code is just as fast as the 32-bit version!观察：代码和32位版本一样快！

This is obviously a hack, as the size of the variable isn't really 64 bit, as other parts of the program still use the 32-bit version, but as long as the inner popcount-loop dominates performance, this is a good start.这显然是一个 hack，因为变量的大小并不是真正的 64 位，因为程序的其他部分仍然使用 32 位版本，但只要内部 popcount-loop 主导性能，这是一个好的开始.

I then copied the inner loop code from the 32-bit version of the program, hacked it up to be 64 bit, fiddled with the registers to make it a replacement for the inner loop of the 64-bit version.然后我从程序的 32 位版本中复制了内循环代码，将其修改为 64 位，修改寄存器以使其替代 64 位版本的内循环。 This code also runs as fast as the 32-bit version.此代码的运行速度也与 32 位版本一样快。

My conclusion is that this is bad instruction scheduling by the compiler, not actual speed/latency advantage of 32-bit instructions.我的结论是，这是编译器糟糕的指令调度，而不是 32 位指令的实际速度/延迟优势。

(Caveat: I hacked up assembly, could have broken something without noticing. I don't think so.) （警告：我破坏了程序集，可能在没有注意到的情况下破坏了某些东西。我不这么认为。）

Answer 3

This is not an answer, but it's hard to read if I put results in comment.这不是答案，但如果我将结果放在评论中，则很难阅读。

I get these results with a Mac Pro ( Westmere 6-Cores Xeon 3.33 GHz).我使用Mac Pro （ Westmere 6-Cores Xeon 3.33 GHz）得到了这些结果。 I compiled it with clang -O3 -msse4 -lstdc++ a.cpp -oa (-O2 get same result).我用clang -O3 -msse4 -lstdc++ a.cpp -oa编译它（-O2 得到相同的结果）。

clang with `uint64_t size=atol(argv[1])<<20;` `uint64_t size=atol(argv[1])<<20;`

unsigned    41950110000 0.811198 sec    12.9263 GB/s
uint64_t    41950110000 0.622884 sec    16.8342 GB/s

clang with `uint64_t size=1<<20;` `uint64_t size=1<<20;`

unsigned    41950110000 0.623406 sec    16.8201 GB/s
uint64_t    41950110000 0.623685 sec    16.8126 GB/s

I also tried to:我也尝试过：

Reverse the test order, the result is the same so it rules out the cache factor.颠倒测试顺序，结果相同所以排除了缓存因素。
Have the for statement in reverse: for (uint64_t i=size/8;i>0;i-=4) .使用相反的for语句： for (uint64_t i=size/8;i>0;i-=4) 。 This gives the same result and proves the compile is smart enough to not divide size by 8 every iteration (as expected).这给出了相同的结果并证明编译足够聪明，不会每次迭代都将大小除以 8（如预期的那样）。

Here is my wild guess:这是我的疯狂猜测：

The speed factor comes in three parts:速度因素分为三个部分：

code cache: uint64_t version has larger code size, but this does not have an effect on my Xeon CPU.代码缓存： uint64_t版本有更大的代码大小，但这对我的至强 CPU 没有影响。 This makes the 64-bit version slower.这会使 64 位版本变慢。
Instructions used.使用的说明。 Note not only the loop count, but the buffer is accessed with a 32-bit and 64-bit index on the two versions.不仅要注意循环计数，还要注意在两个版本上使用 32 位和 64 位索引访问缓冲区。 Accessing a pointer with a 64-bit offset requests a dedicated 64-bit register and addressing, while you can use immediate for a 32-bit offset.访问具有 64 位偏移量的指针需要专用的 64 位寄存器和寻址，而您可以对 32 位偏移量使用立即数。 This may make the 32-bit version faster.这可能会使 32 位版本更快。
Instructions are only emitted on the 64-bit compile (that is, prefetch).指令仅在 64 位编译（即预取）上发出。 This makes 64-bit faster.这使得 64 位更快。

The three factors together match with the observed seemingly conflicting results.这三个因素一起与观察到的看似矛盾的结果相匹配。

Answer 4

I can't give an authoritative answer, but provide an overview of a likely cause.我无法给出权威答案，但提供可能原因的概述。 This reference shows pretty clearly that for the instructions in the body of your loop there is a 3:1 ratio between latency and throughput.此参考非常清楚地表明，对于循环主体中的指令，延迟和吞吐量之间的比率为 3:1。 It also shows the effects of multiple dispatch.它还显示了多次分派的效果。 Since there are (give-or-take) three integer units in modern x86 processors, it's generally possible to dispatch three instructions per cycle.由于现代 x86 处理器中有（给予或接受）三个整数单元，因此通常每个周期可以分派三个指令。

So between peak pipeline and multiple dispatch performance and failure of these mechanisms, we have a factor of six in performance.因此，在峰值管道和多分派性能以及这些机制的故障之间，我们的性能有 6 倍。 It's pretty well known that the complexity of the x86 instruction set makes it quite easy for quirky breakage to occur.众所周知，x86 指令集的复杂性使得古怪的破坏很容易发生。 The document above has a great example:上面的文档有一个很好的例子：

The Pentium 4 performance for 64-bit right shifts is really poor. Pentium 4 64 位右移的性能真的很差。 64-bit left shift as well as all 32-bit shifts have acceptable performance. 64 位左移以及所有 32 位移位都具有可接受的性能。 It appears that the data path from the upper 32 bits to the lower 32 bit of the ALU is not well designed.从 ALU 的高 32 位到低 32 位的数据路径似乎没有设计好。

I personally ran into a strange case where a hot loop ran considerably slower on a specific core of a four-core chip (AMD if I recall).我个人遇到了一个奇怪的情况，即热循环在四核芯片（如果我记得是 AMD）的特定内核上运行速度要慢得多。 We actually got better performance on a map-reduce calculation by turning that core off.通过关闭该核心，我们实际上在 map-reduce 计算中获得了更好的性能。

Here my guess is contention for integer units: that the popcnt , loop counter, and address calculations can all just barely run at full speed with the 32-bit wide counter, but the 64-bit counter causes contention and pipeline stalls.这里我的猜测是对整数单元的争用： popcnt 、循环计数器和地址计算都只能用 32 位宽的计数器全速运行，但 64 位计数器会导致争用和管道停顿。 Since there are only about 12 cycles total, potentially 4 cycles with multiple dispatch, per loop body execution, a single stall could reasonably affect run time by a factor of 2.由于每个循环体执行总共只有大约 12 个周期，可能有 4 个周期有多个分派，因此单个停顿可能会合理地将运行时间影响为 2 倍。

The change induced by using a static variable, which I'm guessing just causes a minor reordering of instructions, is another clue that the 32-bit code is at some tipping point for contention.使用静态变量引起的变化（我猜这只会导致指令的轻微重新排序）是 32 位代码处于争用临界点的另一个线索。

I know this is not a rigorous analysis, but it is a plausible explanation.我知道这不是一个严谨的分析，但它是一个似是而非的解释。

Answer 5

I tried this with Visual Studio 2013 Express , using a pointer instead of an index, which sped up the process a bit.我在Visual Studio 2013 Express 中尝试了这个，使用指针而不是索引，这稍微加快了进程。 I suspect this is because the addressing is offset + register, instead of offset + register + (register<<3).我怀疑这是因为寻址是偏移量+寄存器，而不是偏移量+寄存器+（寄存器<<3）。 C++ code. C++ 代码。

   uint64_t* bfrend = buffer+(size/8);
   uint64_t* bfrptr;

// ...

   {
      startP = chrono::system_clock::now();
      count = 0;
      for (unsigned k = 0; k < 10000; k++){
         // Tight unrolled loop with uint64_t
         for (bfrptr = buffer; bfrptr < bfrend;){
            count += __popcnt64(*bfrptr++);
            count += __popcnt64(*bfrptr++);
            count += __popcnt64(*bfrptr++);
            count += __popcnt64(*bfrptr++);
         }
      }
      endP = chrono::system_clock::now();
      duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
           << (10000.0*size)/(duration) << " GB/s" << endl;
   }

assembly code: r10 = bfrptr, r15 = bfrend, rsi = count, rdi = buffer, r13 = k :汇编代码：r10 = bfrptr，r15 = bfrend，rsi = 计数，rdi = 缓冲区，r13 = k：

$LL5@main:
        mov     r10, rdi
        cmp     rdi, r15
        jae     SHORT $LN4@main
        npad    4
$LL2@main:
        mov     rax, QWORD PTR [r10+24]
        mov     rcx, QWORD PTR [r10+16]
        mov     r8, QWORD PTR [r10+8]
        mov     r9, QWORD PTR [r10]
        popcnt  rdx, rax
        popcnt  rax, rcx
        add     rdx, rax
        popcnt  rax, r8
        add     r10, 32
        add     rdx, rax
        popcnt  rax, r9
        add     rsi, rax
        add     rsi, rdx
        cmp     r10, r15
        jb      SHORT $LL2@main
$LN4@main:
        dec     r13
        jne     SHORT $LL5@main

Answer 6

Have you tried passing -funroll-loops -fprefetch-loop-arrays to GCC?您是否尝试过将-funroll-loops -fprefetch-loop-arrays传递给 GCC？

I get the following results with these additional optimizations:通过这些额外的优化，我得到以下结果：

[1829] /tmp/so_25078285 $ cat /proc/cpuinfo |grep CPU|head -n1
model name      : Intel(R) Core(TM) i3-3225 CPU @ 3.30GHz
[1829] /tmp/so_25078285 $ g++ --version|head -n1
g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3

[1829] /tmp/so_25078285 $ g++ -O3 -march=native -std=c++11 test.cpp -o test_o3
[1829] /tmp/so_25078285 $ g++ -O3 -march=native -funroll-loops -fprefetch-loop-arrays -std=c++11     test.cpp -o test_o3_unroll_loops__and__prefetch_loop_arrays

[1829] /tmp/so_25078285 $ ./test_o3 1
unsigned        41959360000     0.595 sec       17.6231 GB/s
uint64_t        41959360000     0.898626 sec    11.6687 GB/s

[1829] /tmp/so_25078285 $ ./test_o3_unroll_loops__and__prefetch_loop_arrays 1
unsigned        41959360000     0.618222 sec    16.9612 GB/s
uint64_t        41959360000     0.407304 sec    25.7443 GB/s

Answer 7

Have you tried moving the reduction step outside the loop?您是否尝试将减少步骤移到循环之外？ Right now you have a data dependency that really isn't needed.现在您有一个真正不需要的数据依赖项。

Try:尝试：

  uint64_t subset_counts[4] = {};
  for( unsigned k = 0; k < 10000; k++){
     // Tight unrolled loop with unsigned
     unsigned i=0;
     while (i < size/8) {
        subset_counts[0] += _mm_popcnt_u64(buffer[i]);
        subset_counts[1] += _mm_popcnt_u64(buffer[i+1]);
        subset_counts[2] += _mm_popcnt_u64(buffer[i+2]);
        subset_counts[3] += _mm_popcnt_u64(buffer[i+3]);
        i += 4;
     }
  }
  count = subset_counts[0] + subset_counts[1] + subset_counts[2] + subset_counts[3];

You also have some weird aliasing going on, that I'm not sure is conformant to the strict aliasing rules.您还有一些奇怪的别名，我不确定是否符合严格的别名规则。

Answer 8

TL;DR: Use __builtin intrinsics instead; TL;DR：改用__builtin内在函数； they might happen to help.他们可能会有所帮助。

I was able to make gcc 4.8.4 (and even 4.7.3 on gcc.godbolt.org) generate optimal code for this by using __builtin_popcountll which uses the same assembly instruction, but gets lucky and happens to make code that doesn't have an unexpectedly long loop-carried dependency because of the false dependency bug.我能够通过使用使用相同汇编指令的__builtin_popcountll使gcc 4.8.4（甚至 gcc.godbolt.org 上的 4.7.3）生成最佳代码，但很幸运，并且碰巧制作了没有的代码由于错误依赖错误，一个意外长的循环携带依赖。

I am not 100% sure of my benchmarking code, but objdump output seems to share my views.我不是 100% 确定我的基准测试代码，但objdump输出似乎同意我的观点。 I use some other tricks ( ++i vs i++ ) to make the compiler unroll loop for me without any movl instruction (strange behaviour, I must say).我使用其他一些技巧（ ++i vs i++ ）让编译器在没有任何movl指令的情况下为我展开循环（奇怪的行为，我必须说）。

Results:结果：

Count: 20318230000  Elapsed: 0.411156 seconds   Speed: 25.503118 GB/s

Benchmarking code:基准代码：

#include <stdint.h>
#include <stddef.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

uint64_t builtin_popcnt(const uint64_t* buf, size_t len){
  uint64_t cnt = 0;
  for(size_t i = 0; i < len; ++i){
    cnt += __builtin_popcountll(buf[i]);
  }
  return cnt;
}

int main(int argc, char** argv){
  if(argc != 2){
    printf("Usage: %s <buffer size in MB>\n", argv[0]);
    return -1;
  }
  uint64_t size = atol(argv[1]) << 20;
  uint64_t* buffer = (uint64_t*)malloc((size/8)*sizeof(*buffer));

  // Spoil copy-on-write memory allocation on *nix
  for (size_t i = 0; i < (size / 8); i++) {
    buffer[i] = random();
  }
  uint64_t count = 0;
  clock_t tic = clock();
  for(size_t i = 0; i < 10000; ++i){
    count += builtin_popcnt(buffer, size/8);
  }
  clock_t toc = clock();
  printf("Count: %lu\tElapsed: %f seconds\tSpeed: %f GB/s\n", count, (double)(toc - tic) / CLOCKS_PER_SEC, ((10000.0*size)/(((double)(toc - tic)*1e+9) / CLOCKS_PER_SEC)));
  return 0;
}

Compile options:编译选项：

gcc --std=gnu99 -mpopcnt -O3 -funroll-loops -march=native bench.c -o bench

GCC version:海湾合作委员会版本：

gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4

Linux kernel version: Linux内核版本：

3.19.0-58-generic

CPU information: CPU信息：

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 70
model name  : Intel(R) Core(TM) i7-4870HQ CPU @ 2.50 GHz
stepping    : 1
microcode   : 0xf
cpu MHz     : 2494.226
cache size  : 6144 KB
physical id : 0
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm arat pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid xsaveopt
bugs        :
bogomips    : 4988.45
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Answer 9

This is not an answer but a feedback with few compilers of 2021. On Intel CoffeeLake 9900k.这不是答案，而是对 2021 年少数编译器的反馈。在英特尔 CoffeeLake 9900k 上。

With Microsoft compiler (VS2019), toolset v142:使用 Microsoft 编译器 (VS2019)，工具集 v142：

 unsigned 209695540000 1.8322 sec 28.6152 GB/s uint64_t 209695540000 3.08764 sec 16.9802 GB/s

With Intel compiler 2021:使用英特尔编译器 2021：

 unsigned 209695540000 1.70845 sec 30.688 GB/s uint64_t 209695540000 1.57956 sec 33.1921 GB/s

According to Mysticial's answer, Intel compiler is aware of False Data Dependency, but not Microsoft compiler.根据 Mysticial 的回答，Intel 编译器知道 False Data Dependency，但不知道 Microsoft 编译器。

For intel compiler, I used /QxHost (optimize of CPU's architecture which is that of the host) /Oi (enable intrinsic functions) and #include <nmmintrin.h> instead of #include <immintrin.h> .对于英特尔编译器，我使用了/QxHost （优化了主机的 CPU 架构） /Oi （启用内部函数）和#include <nmmintrin.h>而不是#include <immintrin.h> 。

Full compile command: /GS /W3 /QxHost /Gy /Zi /O2 /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\\Release\\" /EHsc /nologo /Fo"x64\\Release\\" //fprofile-instr-use "x64\\Release\\" /Fp"x64\\Release\\Benchmark.pch" .完整编译命令： /GS /W3 /QxHost /Gy /Zi /O2 /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\\Release\\" /EHsc /nologo /Fo"x64\\Release\\" //fprofile-instr-use "x64\\Release\\" /Fp"x64\\Release\\Benchmark.pch" 。

The decompiled (by IDA 7.5) assembly from ICC:来自 ICC 的反编译（由 IDA 7.5）程序集：

int __cdecl main(int argc, const char **argv, const char **envp)
{
  int v6; // er13
  _BYTE *v8; // rsi
  unsigned int v9; // edi
  unsigned __int64 i; // rbx
  unsigned __int64 v11; // rdi
  int v12; // ebp
  __int64 v13; // r14
  __int64 v14; // rbx
  unsigned int v15; // eax
  unsigned __int64 v16; // rcx
  unsigned int v17; // eax
  unsigned __int64 v18; // rcx
  __int64 v19; // rdx
  unsigned int v20; // eax
  int result; // eax
  std::ostream *v23; // rbx
  char v24; // dl
  std::ostream *v33; // rbx
  std::ostream *v41; // rbx
  __int64 v42; // rdx
  unsigned int v43; // eax
  int v44; // ebp
  __int64 v45; // r14
  __int64 v46; // rbx
  unsigned __int64 v47; // rax
  unsigned __int64 v48; // rax
  std::ostream *v50; // rdi
  char v51; // dl
  std::ostream *v58; // rdi
  std::ostream *v60; // rdi
  __int64 v61; // rdx
  unsigned int v62; // eax

  __asm
  {
    vmovdqa [rsp+98h+var_58], xmm8
    vmovapd [rsp+98h+var_68], xmm7
    vmovapd [rsp+98h+var_78], xmm6
  }
  if ( argc == 2 )
  {
    v6 = atol(argv[1]) << 20;
    _R15 = v6;
    v8 = operator new[](v6);
    if ( v6 )
    {
      v9 = 1;
      for ( i = 0i64; i < v6; i = v9++ )
        v8[i] = rand();
    }
    v11 = (unsigned __int64)v6 >> 3;
    v12 = 0;
    v13 = Xtime_get_ticks_0();
    v14 = 0i64;
    do
    {
      if ( v6 )
      {
        v15 = 4;
        v16 = 0i64;
        do
        {
          v14 += __popcnt(*(_QWORD *)&v8[8 * v16])
               + __popcnt(*(_QWORD *)&v8[8 * v15 - 24])
               + __popcnt(*(_QWORD *)&v8[8 * v15 - 16])
               + __popcnt(*(_QWORD *)&v8[8 * v15 - 8]);
          v16 = v15;
          v15 += 4;
        }
        while ( v11 > v16 );
        v17 = 4;
        v18 = 0i64;
        do
        {
          v14 += __popcnt(*(_QWORD *)&v8[8 * v18])
               + __popcnt(*(_QWORD *)&v8[8 * v17 - 24])
               + __popcnt(*(_QWORD *)&v8[8 * v17 - 16])
               + __popcnt(*(_QWORD *)&v8[8 * v17 - 8]);
          v18 = v17;
          v17 += 4;
        }
        while ( v11 > v18 );
      }
      v12 += 2;
    }
    while ( v12 != 10000 );
    _RBP = 100 * (Xtime_get_ticks_0() - v13);
    std::operator___std::char_traits_char___(std::cout, "unsigned\t");
    v23 = (std::ostream *)std::ostream::operator<<(std::cout, v14);
    std::operator___std::char_traits_char____0(v23, v24);
    __asm
    {
      vmovq   xmm0, rbp
      vmovdqa xmm8, cs:__xmm@00000000000000004530000043300000
      vpunpckldq xmm0, xmm0, xmm8
      vmovapd xmm7, cs:__xmm@45300000000000004330000000000000
      vsubpd  xmm0, xmm0, xmm7
      vpermilpd xmm1, xmm0, 1
      vaddsd  xmm6, xmm1, xmm0
      vdivsd  xmm1, xmm6, cs:__real@41cdcd6500000000
    }
    v33 = (std::ostream *)std::ostream::operator<<(v23);
    std::operator___std::char_traits_char___(v33, " sec \t");
    __asm
    {
      vmovq   xmm0, r15
      vpunpckldq xmm0, xmm0, xmm8
      vsubpd  xmm0, xmm0, xmm7
      vpermilpd xmm1, xmm0, 1
      vaddsd  xmm0, xmm1, xmm0
      vmulsd  xmm7, xmm0, cs:__real@40c3880000000000
      vdivsd  xmm1, xmm7, xmm6
    }
    v41 = (std::ostream *)std::ostream::operator<<(v33);
    std::operator___std::char_traits_char___(v41, " GB/s");
    LOBYTE(v42) = 10;
    v43 = std::ios::widen((char *)v41 + *(int *)(*(_QWORD *)v41 + 4i64), v42);
    std::ostream::put(v41, v43);
    std::ostream::flush(v41);
    v44 = 0;
    v45 = Xtime_get_ticks_0();
    v46 = 0i64;
    do
    {
      if ( v6 )
      {
        v47 = 0i64;
        do
        {
          v46 += __popcnt(*(_QWORD *)&v8[8 * v47])
               + __popcnt(*(_QWORD *)&v8[8 * v47 + 8])
               + __popcnt(*(_QWORD *)&v8[8 * v47 + 16])
               + __popcnt(*(_QWORD *)&v8[8 * v47 + 24]);
          v47 += 4i64;
        }
        while ( v47 < v11 );
        v48 = 0i64;
        do
        {
          v46 += __popcnt(*(_QWORD *)&v8[8 * v48])
               + __popcnt(*(_QWORD *)&v8[8 * v48 + 8])
               + __popcnt(*(_QWORD *)&v8[8 * v48 + 16])
               + __popcnt(*(_QWORD *)&v8[8 * v48 + 24]);
          v48 += 4i64;
        }
        while ( v48 < v11 );
      }
      v44 += 2;
    }
    while ( v44 != 10000 );
    _RBP = 100 * (Xtime_get_ticks_0() - v45);
    std::operator___std::char_traits_char___(std::cout, "uint64_t\t");
    v50 = (std::ostream *)std::ostream::operator<<(std::cout, v46);
    std::operator___std::char_traits_char____0(v50, v51);
    __asm
    {
      vmovq   xmm0, rbp
      vpunpckldq xmm0, xmm0, cs:__xmm@00000000000000004530000043300000
      vsubpd  xmm0, xmm0, cs:__xmm@45300000000000004330000000000000
      vpermilpd xmm1, xmm0, 1
      vaddsd  xmm6, xmm1, xmm0
      vdivsd  xmm1, xmm6, cs:__real@41cdcd6500000000
    }
    v58 = (std::ostream *)std::ostream::operator<<(v50);
    std::operator___std::char_traits_char___(v58, " sec \t");
    __asm { vdivsd  xmm1, xmm7, xmm6 }
    v60 = (std::ostream *)std::ostream::operator<<(v58);
    std::operator___std::char_traits_char___(v60, " GB/s");
    LOBYTE(v61) = 10;
    v62 = std::ios::widen((char *)v60 + *(int *)(*(_QWORD *)v60 + 4i64), v61);
    std::ostream::put(v60, v62);
    std::ostream::flush(v60);
    free(v8);
    result = 0;
  }
  else
  {
    std::operator___std::char_traits_char___(std::cerr, "usage: array_size in MB");
    LOBYTE(v19) = 10;
    v20 = std::ios::widen((char *)&std::cerr + *((int *)std::cerr + 1), v19);
    std::ostream::put(std::cerr, v20);
    std::ostream::flush(std::cerr);
    result = -1;
  }
  __asm
  {
    vmovaps xmm6, [rsp+98h+var_78]
    vmovaps xmm7, [rsp+98h+var_68]
    vmovaps xmm8, [rsp+98h+var_58]
  }
  return result;
}

and disassembly of main:和主要的拆卸：

.text:0140001000    .686p
.text:0140001000    .mmx
.text:0140001000    .model flat
.text:0140001000
.text:0140001000 ; ===========================================================================
.text:0140001000
.text:0140001000 ; Segment type: Pure code
.text:0140001000 ; Segment permissions: Read/Execute
.text:0140001000 _text           segment para public 'CODE' use64
.text:0140001000    assume cs:_text
.text:0140001000    ;org 140001000h
.text:0140001000    assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing
.text:0140001000
.text:0140001000 ; =============== S U B R O U T I N E =======================================
.text:0140001000
.text:0140001000
.text:0140001000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:0140001000 main            proc near      ; CODE XREF: __scrt_common_main_seh+107↓p
.text:0140001000      ; DATA XREF: .pdata:ExceptionDir↓o
.text:0140001000
.text:0140001000 var_78          = xmmword ptr -78h
.text:0140001000 var_68          = xmmword ptr -68h
.text:0140001000 var_58          = xmmword ptr -58h
.text:0140001000
.text:0140001000    push    r15
.text:0140001002    push    r14
.text:0140001004    push    r13
.text:0140001006    push    r12
.text:0140001008    push    rsi
.text:0140001009    push    rdi
.text:014000100A    push    rbp
.text:014000100B    push    rbx
.text:014000100C    sub     rsp, 58h
.text:0140001010    vmovdqa [rsp+98h+var_58], xmm8
.text:0140001016    vmovapd [rsp+98h+var_68], xmm7
.text:014000101C    vmovapd [rsp+98h+var_78], xmm6
.text:0140001022    cmp     ecx, 2
.text:0140001025    jnz     loc_14000113E
.text:014000102B    mov     rcx, [rdx+8]    ; String
.text:014000102F    call    cs:__imp_atol
.text:0140001035    mov     r13d, eax
.text:0140001038    shl     r13d, 14h
.text:014000103C    movsxd  r15, r13d
.text:014000103F    mov     rcx, r15        ; size
.text:0140001042    call    ??_U@YAPEAX_K@Z ; operator new[](unsigned __int64)
.text:0140001047    mov     rsi, rax
.text:014000104A    test    r15d, r15d
.text:014000104D    jz      short loc_14000106E
.text:014000104F    mov     edi, 1
.text:0140001054    xor     ebx, ebx
.text:0140001056    mov     rbp, cs:__imp_rand
.text:014000105D    nop     dword ptr [rax]
.text:0140001060
.text:0140001060 loc_140001060:    ; CODE XREF: main+6C↓j
.text:0140001060    call    rbp ; __imp_rand
.text:0140001062    mov     [rsi+rbx], al
.text:0140001065    mov     ebx, edi
.text:0140001067    inc     edi
.text:0140001069    cmp     rbx, r15
.text:014000106C    jb      short loc_140001060
.text:014000106E
.text:014000106E loc_14000106E:    ; CODE XREF: main+4D↑j
.text:014000106E    mov     rdi, r15
.text:0140001071    shr     rdi, 3
.text:0140001075    xor     ebp, ebp
.text:0140001077    call    _Xtime_get_ticks_0
.text:014000107C    mov     r14, rax
.text:014000107F    xor     ebx, ebx
.text:0140001081    jmp     short loc_14000109F
.text:0140001081 ; ---------------------------------------------------------------------------
.text:0140001083    align 10h
.text:0140001090
.text:0140001090 loc_140001090:    ; CODE XREF: main+A2↓j
.text:0140001090      ; main+EC↓j ...
.text:0140001090    add     ebp, 2
.text:0140001093    cmp     ebp, 2710h
.text:0140001099    jz      loc_140001184
.text:014000109F
.text:014000109F loc_14000109F:    ; CODE XREF: main+81↑j
.text:014000109F    test    r13d, r13d
.text:01400010A2    jz      short loc_140001090
.text:01400010A4    mov     eax, 4
.text:01400010A9    xor     ecx, ecx
.text:01400010AB    nop     dword ptr [rax+rax+00h]
.text:01400010B0
.text:01400010B0 loc_1400010B0:    ; CODE XREF: main+E7↓j
.text:01400010B0    popcnt  rcx, qword ptr [rsi+rcx*8]
.text:01400010B6    add     rcx, rbx
.text:01400010B9    lea     edx, [rax-3]
.text:01400010BC    popcnt  rdx, qword ptr [rsi+rdx*8]
.text:01400010C2    add     rdx, rcx
.text:01400010C5    lea     ecx, [rax-2]
.text:01400010C8    popcnt  rcx, qword ptr [rsi+rcx*8]
.text:01400010CE    add     rcx, rdx
.text:01400010D1    lea     edx, [rax-1]
.text:01400010D4    xor     ebx, ebx
.text:01400010D6    popcnt  rbx, qword ptr [rsi+rdx*8]
.text:01400010DC    add     rbx, rcx
.text:01400010DF    mov     ecx, eax
.text:01400010E1    add     eax, 4
.text:01400010E4    cmp     rdi, rcx
.text:01400010E7    ja      short loc_1400010B0
.text:01400010E9    test    r13d, r13d
.text:01400010EC    jz      short loc_140001090
.text:01400010EE    mov     eax, 4
.text:01400010F3    xor     ecx, ecx
.text:01400010F5    db      2Eh
.text:01400010F5    nop     word ptr [rax+rax+00000000h]
.text:01400010FF    nop
.text:0140001100
.text:0140001100 loc_140001100:    ; CODE XREF: main+137↓j
.text:0140001100    popcnt  rcx, qword ptr [rsi+rcx*8]
.text:0140001106    add     rcx, rbx
.text:0140001109    lea     edx, [rax-3]
.text:014000110C    popcnt  rdx, qword ptr [rsi+rdx*8]
.text:0140001112    add     rdx, rcx
.text:0140001115    lea     ecx, [rax-2]
.text:0140001118    popcnt  rcx, qword ptr [rsi+rcx*8]
.text:014000111E    add     rcx, rdx
.text:0140001121    lea     edx, [rax-1]
.text:0140001124    xor     ebx, ebx
.text:0140001126    popcnt  rbx, qword ptr [rsi+rdx*8]
.text:014000112C    add     rbx, rcx
.text:014000112F    mov     ecx, eax
.text:0140001131    add     eax, 4
.text:0140001134    cmp     rdi, rcx
.text:0140001137    ja      short loc_140001100
.text:0140001139    jmp     loc_140001090
.text:014000113E ; ---------------------------------------------------------------------------
.text:014000113E
.text:014000113E loc_14000113E:    ; CODE XREF: main+25↑j
.text:014000113E    mov     rsi, cs:__imp_?cerr@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::ostream std::cerr
.text:0140001145    lea     rdx, aUsageArraySize ; "usage: array_size in MB"
.text:014000114C    mov     rcx, rsi        ; std::ostream *
.text:014000114F    call    std__operator___std__char_traits_char___
.text:0140001154    mov     rax, [rsi]
.text:0140001157    movsxd  rcx, dword ptr [rax+4]
.text:014000115B    add     rcx, rsi
.text:014000115E    mov     dl, 0Ah
.text:0140001160    call    cs:__imp_?widen@?$basic_ios@DU?$char_traits@D@std@@@std@@QEBADD@Z ; std::ios::widen(char)
.text:0140001166    mov     rcx, rsi
.text:0140001169    mov     edx, eax
.text:014000116B    call    cs:__imp_?put@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@D@Z ; std::ostream::put(char)
.text:0140001171    mov     rcx, rsi
.text:0140001174    call    cs:__imp_?flush@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@XZ ; std::ostream::flush(void)
.text:014000117A    mov     eax, 0FFFFFFFFh
.text:014000117F    jmp     loc_1400013E2
.text:0140001184 ; ---------------------------------------------------------------------------
.text:0140001184
.text:0140001184 loc_140001184:    ; CODE XREF: main+99↑j
.text:0140001184    call    _Xtime_get_ticks_0
.text:0140001189    sub     rax, r14
.text:014000118C    imul    rbp, rax, 64h ; 'd'
.text:0140001190    mov     r14, cs:__imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::ostream std::cout
.text:0140001197    lea     rdx, aUnsigned  ; "unsigned\t"
.text:014000119E    mov     rcx, r14        ; std::ostream *
.text:01400011A1    call    std__operator___std__char_traits_char___
.text:01400011A6    mov     rcx, r14
.text:01400011A9    mov     rdx, rbx
.text:01400011AC    call    cs:__imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV01@_K@Z ; std::ostream::operator<<(unsigned __int64)
.text:01400011B2    mov     rbx, rax
.text:01400011B5    mov     rcx, rax        ; std::ostream *
.text:01400011B8    call    std__operator___std__char_traits_char____0
.text:01400011BD    vmovq   xmm0, rbp
.text:01400011C2    vmovdqa xmm8, cs:__xmm@00000000000000004530000043300000
.text:01400011CA    vpunpckldq xmm0, xmm0, xmm8
.text:01400011CF    vmovapd xmm7, cs:__xmm@45300000000000004330000000000000
.text:01400011D7    vsubpd  xmm0, xmm0, xmm7
.text:01400011DB    vpermilpd xmm1, xmm0, 1
.text:01400011E1    vaddsd  xmm6, xmm1, xmm0
.text:01400011E5    vdivsd  xmm1, xmm6, cs:__real@41cdcd6500000000
.text:01400011ED    mov     r12, cs:__imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV01@N@Z ; std::ostream::operator<<(double)
.text:01400011F4    mov     rcx, rbx
.text:01400011F7    call    r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:01400011FA    mov     rbx, rax
.text:01400011FD    lea     rdx, aSec       ; " sec \t"
.text:0140001204    mov     rcx, rax        ; std::ostream *
.text:0140001207    call    std__operator___std__char_traits_char___
.text:014000120C    vmovq   xmm0, r15
.text:0140001211    vpunpckldq xmm0, xmm0, xmm8
.text:0140001216    vsubpd  xmm0, xmm0, xmm7
.text:014000121A    vpermilpd xmm1, xmm0, 1
.text:0140001220    vaddsd  xmm0, xmm1, xmm0
.text:0140001224    vmulsd  xmm7, xmm0, cs:__real@40c3880000000000
.text:014000122C    vdivsd  xmm1, xmm7, xmm6
.text:0140001230    mov     rcx, rbx
.text:0140001233    call    r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:0140001236    mov     rbx, rax
.text:0140001239    lea     rdx, aGbS       ; " GB/s"
.text:0140001240    mov     rcx, rax        ; std::ostream *
.text:0140001243    call    std__operator___std__char_traits_char___
.text:0140001248    mov     rax, [rbx]
.text:014000124B    movsxd  rcx, dword ptr [rax+4]
.text:014000124F    add     rcx, rbx
.text:0140001252    mov     dl, 0Ah
.text:0140001254    call    cs:__imp_?widen@?$basic_ios@DU?$char_traits@D@std@@@std@@QEBADD@Z ; std::ios::widen(char)
.text:014000125A    mov     rcx, rbx
.text:014000125D    mov     edx, eax
.text:014000125F    call    cs:__imp_?put@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@D@Z ; std::ostream::put(char)
.text:0140001265    mov     rcx, rbx
.text:0140001268    call    cs:__imp_?flush@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@XZ ; std::ostream::flush(void)
.text:014000126E    xor     ebp, ebp
.text:0140001270    call    _Xtime_get_ticks_0
.text:0140001275    mov     r14, rax
.text:0140001278    xor     ebx, ebx
.text:014000127A    jmp     short loc_14000128F
.text:014000127A ; ---------------------------------------------------------------------------
.text:014000127C    align 20h
.text:0140001280
.text:0140001280 loc_140001280:    ; CODE XREF: main+292↓j
.text:0140001280      ; main+2DB↓j ...
.text:0140001280    add     ebp, 2
.text:0140001283    cmp     ebp, 2710h
.text:0140001289    jz      loc_14000131D
.text:014000128F
.text:014000128F loc_14000128F:    ; CODE XREF: main+27A↑j
.text:014000128F    test    r13d, r13d
.text:0140001292    jz      short loc_140001280
.text:0140001294    xor     eax, eax
.text:0140001296    db      2Eh
.text:0140001296    nop     word ptr [rax+rax+00000000h]
.text:01400012A0
.text:01400012A0 loc_1400012A0:    ; CODE XREF: main+2D6↓j
.text:01400012A0    xor     ecx, ecx
.text:01400012A2    popcnt  rcx, qword ptr [rsi+rax*8]
.text:01400012A8    add     rcx, rbx
.text:01400012AB    xor     edx, edx
.text:01400012AD    popcnt  rdx, qword ptr [rsi+rax*8+8]
.text:01400012B4    add     rdx, rcx
.text:01400012B7    xor     ecx, ecx
.text:01400012B9    popcnt  rcx, qword ptr [rsi+rax*8+10h]
.text:01400012C0    add     rcx, rdx
.text:01400012C3    xor     ebx, ebx
.text:01400012C5    popcnt  rbx, qword ptr [rsi+rax*8+18h]
.text:01400012CC    add     rbx, rcx
.text:01400012CF    add     rax, 4
.text:01400012D3    cmp     rax, rdi
.text:01400012D6    jb      short loc_1400012A0
.text:01400012D8    test    r13d, r13d
.text:01400012DB    jz      short loc_140001280
.text:01400012DD    xor     eax, eax
.text:01400012DF    nop
.text:01400012E0
.text:01400012E0 loc_1400012E0:    ; CODE XREF: main+316↓j
.text:01400012E0    xor     ecx, ecx
.text:01400012E2    popcnt  rcx, qword ptr [rsi+rax*8]
.text:01400012E8    add     rcx, rbx
.text:01400012EB    xor     edx, edx
.text:01400012ED    popcnt  rdx, qword ptr [rsi+rax*8+8]
.text:01400012F4    add     rdx, rcx
.text:01400012F7    xor     ecx, ecx
.text:01400012F9    popcnt  rcx, qword ptr [rsi+rax*8+10h]
.text:0140001300    add     rcx, rdx
.text:0140001303    xor     ebx, ebx
.text:0140001305    popcnt  rbx, qword ptr [rsi+rax*8+18h]
.text:014000130C    add     rbx, rcx
.text:014000130F    add     rax, 4
.text:0140001313    cmp     rax, rdi
.text:0140001316    jb      short loc_1400012E0
.text:0140001318    jmp     loc_140001280
.text:014000131D ; ---------------------------------------------------------------------------
.text:014000131D
.text:014000131D loc_14000131D:    ; CODE XREF: main+289↑j
.text:014000131D    call    _Xtime_get_ticks_0
.text:0140001322    sub     rax, r14
.text:0140001325    imul    rbp, rax, 64h ; 'd'
.text:0140001329    mov     rdi, cs:__imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::ostream std::cout
.text:0140001330    lea     rdx, aUint64T   ; "uint64_t\t"
.text:0140001337    mov     rcx, rdi        ; std::ostream *
.text:014000133A    call    std__operator___std__char_traits_char___
.text:014000133F    mov     rcx, rdi
.text:0140001342    mov     rdx, rbx
.text:0140001345    call    cs:__imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV01@_K@Z ; std::ostream::operator<<(unsigned __int64)
.text:014000134B    mov     rdi, rax
.text:014000134E    mov     rcx, rax        ; std::ostream *
.text:0140001351    call    std__operator___std__char_traits_char____0
.text:0140001356    vmovq   xmm0, rbp
.text:014000135B    vpunpckldq xmm0, xmm0, cs:__xmm@00000000000000004530000043300000
.text:0140001363    vsubpd  xmm0, xmm0, cs:__xmm@45300000000000004330000000000000
.text:014000136B    vpermilpd xmm1, xmm0, 1
.text:0140001371    vaddsd  xmm6, xmm1, xmm0
.text:0140001375    vdivsd  xmm1, xmm6, cs:__real@41cdcd6500000000
.text:014000137D    mov     rcx, rdi
.text:0140001380    call    r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:0140001383    mov     rdi, rax
.text:0140001386    lea     rdx, aSec       ; " sec \t"
.text:014000138D    mov     rcx, rax        ; std::ostream *
.text:0140001390    call    std__operator___std__char_traits_char___
.text:0140001395    vdivsd  xmm1, xmm7, xmm6
.text:0140001399    mov     rcx, rdi
.text:014000139C    call    r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:014000139F    mov     rdi, rax
.text:01400013A2    lea     rdx, aGbS       ; " GB/s"
.text:01400013A9    mov     rcx, rax        ; std::ostream *
.text:01400013AC    call    std__operator___std__char_traits_char___
.text:01400013B1    mov     rax, [rdi]
.text:01400013B4    movsxd  rcx, dword ptr [rax+4]
.text:01400013B8    add     rcx, rdi
.text:01400013BB    mov     dl, 0Ah
.text:01400013BD    call    cs:__imp_?widen@?$basic_ios@DU?$char_traits@D@std@@@std@@QEBADD@Z ; std::ios::widen(char)
.text:01400013C3    mov     rcx, rdi
.text:01400013C6    mov     edx, eax
.text:01400013C8    call    cs:__imp_?put@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@D@Z ; std::ostream::put(char)
.text:01400013CE    mov     rcx, rdi
.text:01400013D1    call    cs:__imp_?flush@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@XZ ; std::ostream::flush(void)
.text:01400013D7    mov     rcx, rsi        ; Block
.text:01400013DA    call    cs:__imp_free
.text:01400013E0    xor     eax, eax
.text:01400013E2
.text:01400013E2 loc_1400013E2:    ; CODE XREF: main+17F↑j
.text:01400013E2    vmovaps xmm6, [rsp+98h+var_78]
.text:01400013E8    vmovaps xmm7, [rsp+98h+var_68]
.text:01400013EE    vmovaps xmm8, [rsp+98h+var_58]
.text:01400013F4    add     rsp, 58h
.text:01400013F8    pop     rbx
.text:01400013F9    pop     rbp
.text:01400013FA    pop     rdi
.text:01400013FB    pop     rsi
.text:01400013FC    pop     r12
.text:01400013FE    pop     r13
.text:0140001400    pop     r14
.text:0140001402    pop     r15
.text:0140001404    retn
.text:0140001404 main            endp

Coffee lake specification update "POPCNT instruction may take longer to execute than expected". Coffee Lake 规范更新“POPCNT 指令的执行时间可能比预期的要长”。

Answer 10

First of all, try to estimate peak performance - examine https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf , in particular, Appendix C.首先，尝试估计峰值性能 - 检查https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf ，特别是附录 C。

In your case, it's table C-10 that shows POPCNT instruction has latency = 3 clocks and throughput = 1 clock.在您的情况下，表 C-10 显示 POPCNT 指令的延迟 = 3 个时钟，吞吐量 = 1 个时钟。 Throughput shows your maximal rate in clocks (multiply by core frequency and 8 bytes in case of popcnt64 to get your best possible bandwidth number).吞吐量以时钟为单位显示您的最大速率（乘以核心频率和 8 个字节，在 popcnt64 的情况下，以获得最佳可能的带宽数）。

Now examine what compiler did and sum up throughputs of all other instructions in the loop.现在检查编译器做了什么并总结循环中所有其他指令的吞吐量。 This will give best possible estimate for generated code.这将为生成的代码提供最好的估计。

At last, look at data dependencies between instructions in the loop as they will force latency-large delay instead of throughput - so split instructions of single iteration on data flow chains and calculate latency across them then naively pick up maximal from them.最后，查看循环中指令之间的数据依赖性，因为它们将强制延迟大延迟而不是吞吐量 - 因此在数据流链上拆分单次迭代的指令并计算它们之间的延迟，然后天真地从中获取最大值。 it will give rough estimate taking into account data flow dependencies.考虑到数据流的依赖性，它将给出粗略的估计。

However, in your case, just writing code the right way would eliminate all these complexities.但是，就您而言，只需以正确的方式编写代码就可以消除所有这些复杂性。 Instead of accumulating to the same count variable, just accumulate to different ones (like count0, count1, ... count8) and sum them up at the end.不要累加到相同的计数变量，只需累加到不同的变量（如count0、count1、...count8）并在最后将它们相加。 Or even create an array of counts[8] and accumulate to its elements - perhaps, it will be vectorized even and you will get much better throughput.或者甚至创建一个计数数组 [8] 并累积到其元素 - 也许，它甚至会被矢量化，并且您将获得更好的吞吐量。

PS and never run benchmark for a second, first warm up the core then run loop for at least 10 seconds or better 100 seconds. PS，永远不要运行基准测试一秒钟，首先预热核心然后运行循环至少 10 秒或更好的 100 秒。 otherwise, you will test power management firmware and DVFS implementation in hardware :)否则，您将在硬件中测试电源管理固件和 DVFS 实现:)

PPS I heard endless debates on how much time should benchmark really run. PPS 我听到了关于基准测试应该运行多少时间的无休止的争论。 Most smartest folks are even asking why 10 seconds not 11 or 12. I should admit this is funny in theory.大多数最聪明的人甚至会问为什么 10 秒而不是 11 或 12 秒。我应该承认这在理论上很有趣。 In practice, you just go and run benchmark hundred times in a row and record deviations.实际上，您只需连续运行一百次基准测试并记录偏差即可。 That IS funny.这是有趣的。 Most people do change source and run bench after that exactly ONCE to capture new performance record.大多数人确实会更改源代码并在此之后运行 bench 以捕获新的性能记录。 Do the right things right.正确地做正确的事。

Not convinced still?还不服气？ Just use above C-version of benchmark by assp1r1n3 ( https://stackoverflow.com/a/37026212/9706746 ) and try 100 instead of 10000 in retry loop.只需通过 assp1r1n3 ( https://stackoverflow.com/a/37026212/9706746 ) 使用上述 C 版本的基准测试，并在重试循环中尝试 100 而不是 10000。

My 7960X shows, with RETRY=100:我的 7960X 显示，RETRY=100：

Count: 203182300 Elapsed: 0.008385 seconds Speed: 12.505379 GB/s计数：203182300 已用时间：0.008385 秒速度：12.505379 GB/s

Count: 203182300 Elapsed: 0.011063 seconds Speed: 9.478225 GB/s计数：203182300 已用时间：0.011063 秒速度：9.478225 GB/s

Count: 203182300 Elapsed: 0.011188 seconds Speed: 9.372327 GB/s计数：203182300 已用时间：0.011188 秒速度：9.372327 GB/s

Count: 203182300 Elapsed: 0.010393 seconds Speed: 10.089252 GB/s计数：203182300 已用时间：0.010393 秒速度：10.089252 GB/s

Count: 203182300 Elapsed: 0.009076 seconds Speed: 11.553283 GB/s计数：203182300 已用时间：0.009076 秒速度：11.553283 GB/s

with RETRY=10000:重试=10000：

Count: 20318230000 Elapsed: 0.661791 seconds Speed: 15.844519 GB/s计数：20318230000 已用时间：0.661791 秒速度：15.844519 GB/s

Count: 20318230000 Elapsed: 0.665422 seconds Speed: 15.758060 GB/s计数：20318230000 已用时间：0.665422 秒速度：15.758060 GB/s

Count: 20318230000 Elapsed: 0.660983 seconds Speed: 15.863888 GB/s计数：20318230000 已用时间：0.660983 秒速度：15.863888 GB/s

Count: 20318230000 Elapsed: 0.665337 seconds Speed: 15.760073 GB/s计数：20318230000 已用时间：0.665337 秒速度：15.760073 GB/s

Count: 20318230000 Elapsed: 0.662138 seconds Speed: 15.836215 GB/s计数：20318230000 已用时间：0.662138 秒速度：15.836215 GB/s

PPPS Finally, on "accepted answer" and other mistery ;-) PPPS 最后，关于“接受的答案”和其他谜团 ;-)

Let's use assp1r1n3's answer - he has 2.5Ghz core.让我们使用 assp1r1n3 的答案 - 他有 2.5Ghz 核心。 POPCNT has 1 clock throuhgput, his code is using 64-bit popcnt. POPCNT 有 1 个时钟吞吐量，他的代码使用 64 位 popcnt。 So math is 2.5Ghz * 1 clock * 8 bytes = 20 GB/s for his setup.所以他的设置的数学是 2.5Ghz * 1 个时钟 * 8 个字节 = 20 GB/s。 He is seeing 25Gb/s, perhaps due to turbo boost to around 3Ghz.他看到了 25Gb/s，这可能是由于涡轮增压到 3Ghz 左右。

Thus go to ark.intel.com and look for i7-4870HQ: https://ark.intel.com/products/83504/Intel-Core-i7-4870HQ-Processor-6M-Cache-up-to-3-70-GHz-?q=i7-4870HQ因此，请访问 ark.intel.com 并查找 i7-4870HQ： https ://ark.intel.com/products/83504/Intel-Core-i7-4870HQ-Processor-6M-Cache-up-to-3-70 -GHz-?q=i7-4870HQ

That core could run up to 3.7Ghz and real maximal rate is 29.6 GB/s for his hardware.该内核可以运行高达 3.7Ghz，其硬件的实际最大速率为 29.6 GB/s。 So where is another 4GB/s?那么另一个 4GB/s 在哪里呢？ Perhaps, it's spent on loop logic and other surrounding code within each iteration.也许，它在每次迭代中都花在了循环逻辑和其他周围代码上。

Now where is this false dependency?现在这个错误的依赖在哪里？ hardware runs at almost peak rate.硬件几乎以峰值速率运行。 Maybe my math is bad, it happens sometimes :)也许我的数学不好，有时会发生:)

PPPPPS Still people suggesting HW errata is culprit, so I follow suggestion and created inline asm example, see below. PPPPPS 仍然有人建议硬件勘误表是罪魁祸首，所以我按照建议创建了内联 asm 示例，见下文。

On my 7960X, first version (with single output to cnt0) runs at 11MB/s, second version (with output to cnt0, cnt1, cnt2 and cnt3) runs at 33MB/s.在我的 7960X 上，第一个版本（单输出到 cnt0）运行速度为 11MB/s，第二个版本（输出到 cnt0、cnt1、cnt2 和 cnt3）运行速度为 33MB/s。 And one could say - voila!可以说——瞧！ it's output dependency.这是输出依赖。

OK, maybe, the point I made is that it does not make sense to write code like this and it's not output dependency problem but dumb code generation.好吧，也许，我提出的观点是编写这样的代码没有意义，这不是输出依赖问题，而是愚蠢的代码生成。 We are not testing hardware, we are writing code to unleash maximal performance.我们不是在测试硬件，而是在编写代码以发挥最大性能。 You could expect that HW OOO should rename and hide those "output-dependencies" but, gash, just do the right things right and you will never face any mystery.您可以期望 HW OOO 应该重命名并隐藏那些“输出依赖项”，但是，请正确地做正确的事情，您将永远不会面临任何谜团。

uint64_t builtin_popcnt1a(const uint64_t* buf, size_t len) 
{
    uint64_t cnt0, cnt1, cnt2, cnt3;
    cnt0 = cnt1 = cnt2 = cnt3 = 0;
    uint64_t val = buf[0];
    #if 0
        __asm__ __volatile__ (
            "1:\n\t"
            "popcnt %2, %1\n\t"
            "popcnt %2, %1\n\t"
            "popcnt %2, %1\n\t"
            "popcnt %2, %1\n\t"
            "subq $4, %0\n\t"
            "jnz 1b\n\t"
        : "+q" (len), "=q" (cnt0)
        : "q" (val)
        :
        );
    #else
        __asm__ __volatile__ (
            "1:\n\t"
            "popcnt %5, %1\n\t"
            "popcnt %5, %2\n\t"
            "popcnt %5, %3\n\t"
            "popcnt %5, %4\n\t"
            "subq $4, %0\n\t"
            "jnz 1b\n\t"
        : "+q" (len), "=q" (cnt0), "=q" (cnt1), "=q" (cnt2), "=q" (cnt3)
        : "q" (val)
        :
        );
    #endif
    return cnt0;
}

Answer 11

Ok, I want to provide a small answer to one of the sub-questions that the OP asked that don't seem to be addressed in the existing questions.好的，我想为 OP 提出的一个子问题提供一个小答案，这些子问题似乎在现有问题中没有得到解决。 Caveat, I have not done any testing or code generation, or disassembly, just wanted to share a thought for others to possibly expound upon.警告，我没有做过任何测试或代码生成或反汇编，只是想分享一个想法供其他人可能阐述。

Why does the `static` change the performance?为什么`static`改变性能？

The line in question: uint64_t size = atol(argv[1])<<20;有问题的行： uint64_t size = atol(argv[1])<<20;

Short Answer简答

I would look at the assembly generated for accessing size and see if there are extra steps of pointer indirection involved for the non-static version.我会查看为访问size而生成的程序集，并查看非静态版本是否涉及额外的指针间接步骤。

Long Answer长答案

Since there is only one copy of the variable whether it was declared static or not, and the size doesn't change, I theorize that the difference is the location of the memory used to back the variable along with where it is used in the code further down.由于无论是否声明为static变量，变量都只有一份副本，并且大小不会改变，因此我推断差异在于用于支持变量的内存位置以及它在代码中的使用位置再向下。

Ok, to start with the obvious, remember that all local variables (along with parameters) of a function are provided space on the stack for use as storage.好的，从显而易见的开始，请记住，函数的所有局部变量（以及参数）都在堆栈上提供了用作存储的空间。 Now, obviously, the stack frame for main() never cleans up and is only generated once.现在，很明显，main() 的堆栈帧永远不会清理并且只生成一次。 Ok, what about making it static ?好的，让它成为static怎么样？ Well, in that case the compiler knows to reserve space in the global data space of the process so the location can not be cleared by the removal of a stack frame.那么，在这种情况下，编译器知道要在进程的全局数据空间中保留空间，因此无法通过移除堆栈帧来清除该位置。 But still, we only have one location so what is the difference?但是，我们只有一个位置，那么有什么区别呢？ I suspect it has to do with how memory locations on the stack are referenced.我怀疑这与如何引用堆栈上的内存位置有关。

When the compiler is generating the symbol table, it just makes an entry for a label along with relevant attributes, like size, etc. It knows that it must reserve the appropriate space in memory but doesn't actually pick that location until somewhat later in process after doing liveness analysis and possibly register allocation.当编译器生成符号表时，它只是为标签以及相关属性（如大小等）创建一个条目。它知道它必须在内存中保留适当的空间，但直到稍后才真正选择该位置在进行活性分析和可能的寄存器分配之后的过程。 How then does the linker know what address to provide to the machine code for the final assembly code?那么链接器如何知道为最终汇编代码提供给机器代码的地址呢？ It either knows the final location or knows how to arrive at the location.它要么知道最终位置，要么知道如何到达该位置。 With a stack, it is pretty simple to refer to a location based one two elements, the pointer to the stackframe and then an offset into the frame.使用堆栈，引用基于两个元素的位置非常简单，即指向堆栈帧的指针，然后是帧中的偏移量。 This is basically because the linker can't know the location of the stackframe before runtime.这基本上是因为链接器在运行之前无法知道堆栈帧的位置。

用 64 位替换 32 位循环计数器会在 Intel CPU 上使用 _mm_popcnt_u64 引入疯狂的性能偏差

问题描述

The Benchmark基准

The (absolutely crazy) results （绝对疯狂的）结果

More madness, please!更疯狂，拜托！

My question我的问题

The Disassembly拆卸

Lessons learned得到教训

11 个解决方案

解决方案1
1642 已采纳 2014-08-01 22:41:20

解决方案2
55 2014-08-01 22:55:56

解决方案3
29 2014-08-01 11:04:24

clang with `uint64_t size=atol(argv[1])<<20;` `uint64_t size=atol(argv[1])<<20;`

clang with `uint64_t size=1<<20;` `uint64_t size=1<<20;`

解决方案4
11 2014-08-01 20:12:53

解决方案5
11 2014-08-02 03:48:08

解决方案6
10 2014-08-04 15:37:48

解决方案7
9 2014-08-01 18:33:23

解决方案8
8 2016-05-04 11:14:57

解决方案9
5 2021-01-30 02:44:14

解决方案10
-2 2018-10-23 20:45:00

解决方案11
-3 2019-01-29 20:32:41

Why does the `static` change the performance?为什么`static`改变性能？

Short Answer简答

Long Answer长答案

用 64 位替换 32 位循环计数器会在 Intel CPU 上使用 _mm_popcnt_u64 引入疯狂的性能偏差

问题描述

The Benchmark基准

The (absolutely crazy) results （绝对疯狂的）结果

More madness, please!更疯狂，拜托！

My question我的问题

The Disassembly拆卸

Lessons learned得到教训

11 个解决方案

解决方案1 1642 已采纳 2014-08-01 22:41:20

解决方案2 55 2014-08-01 22:55:56

解决方案3 29 2014-08-01 11:04:24

clang with uint64_t size=atol(argv[1])<<20; uint64_t size=atol(argv[1])<<20;

clang with uint64_t size=1<<20; uint64_t size=1<<20;

解决方案4 11 2014-08-01 20:12:53

解决方案5 11 2014-08-02 03:48:08

解决方案6 10 2014-08-04 15:37:48

解决方案7 9 2014-08-01 18:33:23

解决方案8 8 2016-05-04 11:14:57

解决方案9 5 2021-01-30 02:44:14

解决方案10 -2 2018-10-23 20:45:00

解决方案11 -3 2019-01-29 20:32:41

Why does the static change the performance?为什么static改变性能？

Short Answer简答

Long Answer长答案

解决方案1
1642 已采纳 2014-08-01 22:41:20

解决方案2
55 2014-08-01 22:55:56

解决方案3
29 2014-08-01 11:04:24

clang with `uint64_t size=atol(argv[1])<<20;` `uint64_t size=atol(argv[1])<<20;`

clang with `uint64_t size=1<<20;` `uint64_t size=1<<20;`

解决方案4
11 2014-08-01 20:12:53

解决方案5
11 2014-08-02 03:48:08

解决方案6
10 2014-08-04 15:37:48

解决方案7
9 2014-08-01 18:33:23

解决方案8
8 2016-05-04 11:14:57

解决方案9
5 2021-01-30 02:44:14

解决方案10
-2 2018-10-23 20:45:00

解决方案11
-3 2019-01-29 20:32:41

Why does the `static` change the performance?为什么`static`改变性能？