C ++优化之谜

Question

Take the two following snippets: 采取以下两个片段：

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i);

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}

and 和

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i) >> 2;

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}

I am benchmarking 128 bit integers in C++. 我在C ++中对128位整数进行基准测试。 When executing the first one (just multiplication) everything runs in approx. 当执行第一个（只是乘法）时，一切都在大约运行。 0.95 seconds. 0.95秒 When I also add the bit shift operation (second snippet) the execution time raises to an astounding 2.49 seconds. 当我还添加位移操作（第二个片段）时，执行时间会增加到惊人的2.49秒。

How is this possible? 这怎么可能？ I thought that bit shifting was one of the lightest operations for a processor. 我认为位移是处理器最轻的操作之一。 How comes that there is so much overhead due to such a simple operation? 怎么这么简单的操作会导致如此多的开销呢？ I am compiling with O3 flag activated. 我正在编译O3标志激活。

Any idea? 任何想法？

Answer 1

This question has been bugging me for the past few days, so I decided to do some more investigation. 这个问题在过去的几天一直困扰着我，所以我决定再做一些调查。 My initial answer focused on the difference in data values between the two tests. 我的初步答案集中在两个测试之间数据值的差异。 My assertion was that the integer multiplication unit in the processor finishes an operation in fewer clock cycles if one of the operands is zero. 我的断言是，如果其中一个操作数为零，则处理器中的整数乘法单元在较少的时钟周期内完成操作。

While there are instructions that are clearly documented to work that way (integer division , for example), there are very strong indications that integer multiplication is done in a constant number of cycles in modern processors, regardless of input. 虽然有明确记录的指令以这种方式工作（例如，整数除法），但有很强的迹象表明在现代处理器中，无论输入如何，整数乘法都在恒定的周期数内完成。 The note in Intel's documentation that initially made me think that the number of cycles for integer multiplication can depend on input data doesn't seem to apply to these instructions. 英特尔文档中的说明最初让我认为整数乘法的周期数可能取决于输入数据似乎不适用于这些指令。 Also, I did some more rigorous performance tests with the same sequence of instructions on both zero and non-zero operands and the results didn't yield significant differences. 此外，我在零和非零操作数上使用相同的指令序列进行了一些更严格的性能测试，结果没有产生显着差异。 As far as I can tell, harold's comment on this subject is correct. 据我所知，哈罗德对这个问题的评论是正确的。 My mistake; 我的错; sorry. 抱歉。

While contemplating the possibility of deleting this answer altogether, so that it doesn't lead people astray in the future, I realized there were still quite a few more things worth saying on this subject. 在考虑完全删除这个答案的可能性的同时，它不会导致人们在将来误入歧途，我意识到在这个问题上还有不少值得说的东西。 I also think there's at least one other way in which data values can influence performance in such calculations (included in the last section). 我还认为至少还有一种方法可以在这种计算中影响数据的性能（包含在上一节中）。 So, I decided to restructure and enhance the rest of the information in my initial answer, started writing, and... didn't quite stop for a while. 所以，我决定在我最初的答案中重新构建和增强其余信息，开始写作......并且......暂时没有停止。 It's up to you to decide whether it was worth it. 由你来决定是否值得。

The information is structured into the following sections: 该信息由以下部分组成：

What the code does 代码的作用
What the compiler does 编译器做了什么
What the processor does 处理器的功能
What you can do about it 你可以做些什么呢
Unanswered questions 未回答的问题

What the code does 代码的作用

It overflows, mostly. 它大部分都溢出了。

In the first version, n starts overflowing on the 33 rd iteration. 在第一个版本中， n在第33次迭代中开始溢出。 In the second version, with the shift, n starts overflowing on the 52 nd iteration. 在第二个版本中，随着移位， n在第52次迭代时开始溢出。

In the version without the shift, starting with the 128 th iteration, n is zero (it overflows "cleanly", leaving only zeros in the least significant 128 bits of the result). 在没有移位的版本中，从第128次迭代开始， n为零（它“干净地”溢出，在结果的最低有效128位中仅留下零）。

In the second version, the right shift (dividing by 4 ) takes out more factors of two from the value of n on each iteration than the new operands bring in, so the shift results in rounding on some iterations. 在第二个版本中，右移（除以4 ）在每次迭代中从n的值中取出比新操作数引入的更多因子2，因此移位导致在一些迭代上舍入。 Quick calculation: the total number of factors of two in all numbers from 1 to 128 is equal to 快速计算：从1到128的所有数字中的两个因子的总数等于

128 / 2 + 128 / 4 + ... + 2 + 1 = 2 ⁶ + 2 ⁵ + ... + 2 + 1 = 2 ⁷ - 1 128/2 + 128/4 + ... + 2 + 1 = 2 ⁶ + 2 ⁵ + ... + 2 + 1 = 2 ⁷ - 1

while the number of factors of two taken out by the right shift (if it had enough to take from) is 128 * 2, more than double. 而由右移取的两个因子的数量（如果它足够可取）则为128 * 2，超过两倍。

Armed with this knowledge, we can give a first answer: from the point of view of the C++ standard, this code spends most of its time in undefined behaviour land, so all bets are off. 有了这些知识，我们可以给出第一个答案：从C ++标准的角度来看，这段代码大部分时间都花在未定义的行为上，所以所有的赌注都是关闭的。 Problem solved; 问题解决了; stop reading now. 现在停止阅读。

What the compiler does 编译器做了什么

If you're still reading, from this point forward we'll ignore the overflows and look at some implementation details. 如果你还在阅读，从现在开始我们将忽略溢出并查看一些实现细节。 "The compiler", in this case, means GCC 4.9.2 or Clang 3.5.1. 在这种情况下，“编译器”表示GCC 4.9.2或Clang 3.5.1。 I've only done performance measurements on code generated by GCC. 我只对GCC生成的代码进行了性能测量。 For Clang, I've looked at the generated code for a few test cases and noted some differences that I'll mention below, but I haven't actually run the code; 对于Clang，我查看了几个测试用例的生成代码，并注意到我将在下面提到的一些差异，但我实际上没有运行代码; I might have missed some things. 我可能错过了一些东西。

Both multiplication and shift operations are available for 64-bit operands, so 128-bit operations need to be implemented in terms of those. 乘法和移位操作都可用于64位操作数，因此需要根据这些操作实现128位操作。 First, multiplication: n can be written as 2 ⁶⁴ nh + nl , where nh and nl are the high and low 64-bit halves, respectively. 首先，乘法： n可以写成2 ⁶⁴ nh + nl ，其中nh和nl分别是高和低64位的一半。 The same goes for i . 对i 。 So, the multiplication can be written: 因此，乘法可以写成：

(2 ⁶⁴ nh + nl )(2 ⁶⁴ ih + il ) = 2 ¹²⁸ nh ih + 2 ⁶⁴ ( nh il + nl ih ) + nl il （2 ⁶⁴ nh + nl ）（2 ⁶⁴ ih + il ）= 2 ¹²⁸ nh ih + 2 ⁶⁴ （ nh il + nl ih ）+ nl il

The first term doesn't have any non-zero bits in the lower 128-bit part; 第一项在较低的128位部分中没有任何非零位; it's either all overflow or all zero. 它要么全部溢出，要么全部为零。 Since ignoring integer overflows is valid and common for C++ implementations, that's what the compiler does: the first term is ignored completely. 由于忽略整数溢出对于C ++实现是有效且常见的，这就是编译器的作用：第一个术语被完全忽略。

The parenthesis only contributes bits to the upper 64-bit half of the 128-bit result; 括号仅为128位结果的高64位一半提供位; any overflow resulting from the two multiplications or the addition is also ignored (the result is truncated to 64 bits). 由两次乘法或加法产生的任何溢出也被忽略（结果被截断为64位）。

The last term determines the bits in the low 64-bit half of the result and, if the result of that multiplication has more than 64 bits, the extra bits need to be added to the high 64-bit half obtained from the parenthesis discussed before. 最后一项确定结果的低64位半部分中的位，如果该乘法的结果超过64位，则需要将额外位添加到从前面讨论的括号中获得的高64位半位。。 There's a very useful multiplication instruction in x86-64 assembly that does just what's needed: takes two 64-bit operands and places the result in two 64-bit registers, so the high half is ready to be added to the result of the operations in the parenthesis. 在x86-64汇编中有一个非常有用的乘法指令可以完成所需的操作：获取两个64位操作数并将结果放在两个64位寄存器中，因此高半部分已准备好添加到操作的结果中括号。

That is how 128-bit integer multiplication is implemented: three 64-bit multiplications and two 64-bit additions. 这就是如何实现128位整数乘法：三次64位乘法和两次64位加法。

Now, the shift: using the same notations as above, the two least significant bits of nh need to become the two most significant bits of nl , after the contents of the latter is shifted right by two bits. 现在，移位：使用与上面相同的符号， nh的两个最低有效位需要成为nl的两个最高有效位，后者的内容右移两位。 Using C++ syntax, it would look like this: 使用C ++语法，它看起来像这样：

nl = nh << 62 | nl >> 2 //Doesn't change nh, only uses its bits.

Besides that, nh also needs to be shifted, using something like 除此之外， nh也需要转移，使用类似

nh >>= 2;

That is how the compiler implements a 128-bit shift. 这就是编译器实现128位移位的方式。 For the first part, there's an x86-64 instruction that has the exact semantics of that expression; 对于第一部分，有一个x86-64指令，具有该表达式的确切语义; it's called SHRD . 它被称为SHRD 。 Using it can be good or bad, as we'll see below, and the two compilers make slightly different choices in this respect. 使用它可能是好的也可能是坏的，正如我们将在下面看到的那样，两个编译器在这方面做出了略微不同的选择。

What the processor does 处理器的功能

... is highly processor-dependent. ......依赖于处理器。 (No... really?!) （不完全是？！）

Detailed information about what happens for Haswell processors can be found in harold's excellent answer . 有关Haswell处理器发生情况的详细信息可以在harold的优秀答案中找到。 Here, I'll try to cover more ground at a higher level. 在这里，我将尝试在更高层次上覆盖更多的基础。 For more detailed data, here are some sources: 有关更详细的数据，以下是一些来源：

I'll refer to the following architectures: 我将参考以下架构：

Intel Sandy Bridge / Ivy Bridge - abbreviated "IntelSB" going forward; Intel Sandy Bridge / Ivy Bridge - 缩写为“IntelSB”;
Intel Haswell / Broadwell - "IntelH" going forward; 英特尔Haswell / Broadwell - “IntelH”未来;
I'll just use "Intel" for things that are the same between SB and H. 我只会用“Intel”来表示SB和H之间相同的东西。
AMD Bulldozer / Piledriver / Steamroller - "AMD" going forward. AMD Bulldozer / Piledriver / Steamroller - “AMD”未来。

I have measurement data taken on an IntelSB system; 我在IntelSB系统上获得了测量数据; I think it's precise enough, as long as the compiler doesn't act up . 我认为只要编译器不起作用就足够精确了。 Unfortunately, when working with such tight loops, this can happen very easily. 不幸的是，当使用这种紧密的循环时，这很容易发生。 At various points during testing, I had to use all kinds of stupid tricks to avoid GCC's idiosyncrasies, usually related to register use. 在测试期间的各个阶段，我不得不使用各种愚蠢的技巧来避免GCC的特性，通常与寄存器使用有关。 For example, it seems to have a tendency to shuffle registers around unnecessarily, when compiling simpler code than for other cases when it generates optimal assembly. 例如，当编译更简单的代码时，它似乎有不必要地乱码寄存器的倾向，而不是在生成最佳汇编时的其他情况。 Ironically, on my test setup, it tended to generate optimal code for the second sample, using the shift, and worse code for the first one, making the impact of the shift less visible. 具有讽刺意味的是，在我的测试设置中，它倾向于为第二个样本生成最佳代码，使用移位，而更差的代码用于第一个样本，使得移位的影响不太明显。 Clang/LLVM seems to have fewer of those bad habits, but then again, I looked at fewer examples using it and I didn't measure any of them, so this doesn't mean much. Clang / LLVM似乎有较少的坏习惯，但是再一次，我看了很少使用它的例子，我没有测量它们中的任何一个，所以这并不意味着什么。 In the interest of comparing apples with apples, all measurement data below refers to the best code generated for each case. 为了比较苹果和苹果，下面的所有测量数据都是指为每种情况生成的最佳代码。

First, let's rearrange the expression for 128-bit multiplication from the previous section into a (horrible) diagram: 首先，让我们将上一节中128位乘法的表达式重新排列成（可怕的）图表：

nh * il
        \
         +  -> tmp
        /          \
nl * ih             + -> next nh
                   /
             high 64 bits
                 /
nl * il --------
         \
      low 64 bits 
           \
             -> next nl

(sorry, I hope it gets the point across) （对不起，我希望它能说明问题）

Some important points: 一些要点：

The two additions can't execute until their respective inputs are ready; 在它们各自的输入就绪之前，这两个新增功能无法执行; the final addition can't execute until everything else is ready. 在其他所有准备就绪之前，最终添加不能执行。
The three multiplications can, theoretically, execute in parallel (no input depends on another multiplication's output). 理论上，这三个乘法可以并行执行（没有输入取决于另一个乘法的输出）。
In the ideal scenario above, the total number of cycles to complete the entire calculation for one iteration is the sum of the number of cycles for one multiplication and two additions. 在上面的理想情况中，完成一次迭代的整个计算的总循环次数是一次乘法和两次加法的循环次数之和。
The next nl can be ready early. next nl可以提前准备好。 This, together with the fact that the next il and ih are very cheap to calculate, means the nl * il and nl * ih calculations for the next iteration can start early, possibly before the next nh has been computed. 这与下一个il和ih计算起来非常便宜的事实一起，意味着下一次迭代的nl * il和nl * ih计算可以提前开始，可能在计算next nh之前。

Multiplications can't really execute entirely in parallel on these processors, as there's only one integer multiplication unit for each core, but they can execute concurrently through pipelining. 乘法不能真正在这些处理器上完全并行执行，因为每个内核只有一个整数乘法单元，但它们可以通过流水线并发执行。 One multiplication can begin executing on each cycle on Intel, and every 4 cycles on AMD, even if previous multiplications haven't finished executing yet. 一次乘法可以在Intel上的每个周期开始执行，并且在AMD上每4个周期开始执行，即使先前的乘法还没有完成执行。

All of the above mean that, if the loop's body doesn't contain anything else that gets in the way, the processor can reorder those multiplications to achieve something as close as possible to the ideal scenario above. 所有这些意味着，如果循环的主体不包含任何其他阻碍的东西，处理器可以重新排序这些乘法，以实现尽可能接近上述理想情况的东西。 This applies to the first code snippet. 这适用于第一个代码段。 On IntelH, as measured by harold, it's exactly the ideal scenario: 5 cycles per iteration are made up of 3 cycles for one multiplication and one cycle each for the two additions (impressive, to be honest). 在Harold测量的IntelH上，这正是理想的情况：每次迭代5个周期由一个乘法的3个周期和两个加法的每个周期组成一个周期（令人印象深刻，说实话）。 On IntelSB, I measured 6 cycles per iteration (closer to 5.5, actually). 在IntelSB上，我每次迭代测量了6个周期（实际上接近5.5）。

The problem is that in the second code snippet something does get in the way: 问题是，在第二个代码片段中，某些事情会妨碍：

nh * il
        \                              normal shift -> next nh
         +  -> tmp                   /
        /          \                /
nl * ih             + ----> temp nh
                   /                \
             high 64 bits            \
                 /                     "composite" shift -> next nl
nl * il --------                     /
         \                          /
      low 64 bits                  /
           \                      /
             -> temp nl ---------

The next nl is no longer ready early. next nl不再提前准备好了。 temp nl has to wait for temp nh to be ready, so that both can be fed into the composite shift , and only then will we have the next nl . temp nl必须等待temp nh准备就绪，这样两者都可以被送入composite shift ，只有这样我们才能拥有next nl 。 Even if both shifts are very fast and execute in parallel, they don't just add the execution cost of one shift to an iteration; 即使两个移位都非常快并且并行执行，它们也不只是将一个移位的执行成本添加到迭代中; they also change the dynamics of the loop's "pipeline", acting like a sort of synchronizing barrier. 它们也改变了循环“管道”的动态，就像一种同步障碍。

If the two shifts finish at the same time, then all three multiplications for the next iteration will be ready to execute at the same time, and they can't all start in parallel, as explained above; 如果两个移位同时完成，则下一次迭代的所有三次乘法将准备好同时执行，并且它们不能全部并行开始，如上所述; they'll have to wait for one another, wasting cycles. 他们将不得不等待彼此，浪费周期。 This is the case on IntelSB, where the two shifts are equally fast (see below); 这就是IntelSB的情况，其中两个班次同样快（见下文）; I measured 8 cycles per iteration for this case. 对于这种情况，我每次迭代测量了8个周期。

If the two shifts don't finish at the same time, it will typically be the normal shift that finishes first (the composite shift is slower on most architectures). 如果两个班次没有同时完成，通常会先完成正常班次（大多数架构上的复合班次较慢）。 This means that the next nh will be ready early, so the top multiplication can start early for the next iteration. 这意味着next nh将提前就绪，因此顶部乘法可以在下一次迭代中提前开始。 However, the other two multiplications still have to wait more (wasted) cycles for the composite shift to finish, and then they'll be ready at the same time and one will have to wait for the other to start, wasting some more time. 然而，另外两次乘法仍然需要等待更多（浪费）周期才能完成复合移位，然后它们将在同一时间准备好，并且必须等待另一个开始，浪费更多时间。 This is the case on IntelH, measured by harold at 9 cycles per iteration. 这是IntelH的情况，由harold以每次迭代9个周期来衡量。

I expect AMD to fall under this last category as well. 我预计AMD也属于最后一类。 While there's an even bigger difference in performance between the composite shift and normal shift on this platform, integer multiplications are also slower on AMD than on Intel (more than twice as slow), making the first sample slower to begin with. 虽然在这个平台上复合移位和正常移位之间的性能差异更大，但AMD上的整数乘法也比英特尔慢（慢两倍以上），这使得第一个样本开始变慢。 As a very rough estimate, I think the first version could take about 12 cycles on AMD, with the second one at around 16. It would be nice to have some concrete measurements, though. 作为一个非常粗略的估计，我认为第一个版本可能需要大约12个周期的AMD，第二个版本大约16个。但是，有一些具体的测量结果会很好。

Some more data on the troublesome composite shift, SHRD : 关于麻烦的复合材料移位的更多数据， SHRD ：

On IntelSB, it's exactly as cheap as a simple shift (great!); 在IntelSB上，它与简单的移动一样便宜（太棒了！）; simple shifts are about as cheap as they come: they execute in one cycle, and two shifts can start executing each cycle. 简单的转变就像它们一样便宜：它们在一个周期内执行，两个班次可以开始执行每个周期。
On IntelH, SHRD takes 3 cycles to execute (yes, it got worse in the newer generation), and two shifts of any kind (simple or composite) can start executing each cycle; 在IntelH上， SHRD需要3个周期来执行（是的，它在新一代中变得更糟），并且任何类型的两个移位（简单或复合）都可以开始执行每个周期;
On AMD, it's even worse. 在AMD上，情况更糟。 If I'm reading the data correctly, executing an SHRD keeps both shift execution units busy until execution finishes - no parallelism and no pipelining possible; 如果我正确读取数据，执行SHRD会使两个移位执行单元都忙，直到执行完成 - 没有并行性，也没有流水线可能; it takes 3 cycles, during which no other shift can start executing. 它需要3个周期，在此期间没有其他班次可以开始执行。

What you can do about it 你可以做些什么呢

I can think of three possible improvements: 我可以想到三个可能的改进：

replace SHRD with something faster on platforms where it makes sense; 在有意义的平台上用更快的东西取代SHRD ;
optimize the multiplication to take advantage of the data types involved here; 优化乘法以利用此处涉及的数据类型;
restructure the loop. 重组循环。

1. SHRD can be replaced with two shifts and a bitwise OR, as described in the compiler section. 1.如编译器部分所述， SHRD可以用两个移位和一个按位OR替换。 A C++ implementation of a 128-bit shift right by two bits could look like this: 一个128位右移两位的C ++实现可能如下所示：

__int128_t shr2(__int128_t n)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   uint64_t rl = nl >> 2 | nh << 62;
   int64_t rh = nh >> 2;

   //Pack the result.
   return static_cast<__int128_t>(rh) << 64 | rl;
}

Although it looks like a lot of code, only the middle section doing the actual work generates shifts and ORs. 虽然它看起来像很多代码，但只有执行实际工作的中间部分才会产生移位和OR。 The other parts merely indicate to the compiler which 64-bit parts we want to work with; 其他部分仅向编译器指示我们要使用哪些64位部分; since the 64-bit parts are already in separate registers, those are effectively no-ops in the generated assembly code. 由于64位部分已经在单独的寄存器中，因此在生成的汇编代码中实际上是无操作的。

However, keep in mind that this amounts to "trying to write assembly using C++ syntax", and it's generally not a very bright idea. 但是，请记住，这相当于“尝试使用C ++语法编写程序集”，而且通常不是一个非常好的主意。 I'm only using it because I verified that it works for GCC and I'm trying to keep the amount of assembly code in this answer to a minimum. 我只使用它，因为我确认它适用于GCC，我试图将此答案中的汇编代码量保持在最低限度。 Even so, there's one surprise: the LLVM optimizer detects what we're trying to do with those two shifts and one OR and... replaces them with an SHRD in some cases (more about this below). 即使这样，也有一个惊喜：LLVM优化器检测到我们试图用这两个移位和一个OR来做什么，并且......在某些情况下用SHRD替换它们（更多关于此的内容）。

Functions of the same form can be used for shifts by other numbers of bits, less than 64. From 64 to 127, it gets easier, but the form changes. 相同形式的函数可以用于其他位数的移位，小于64.从64到127，它变得更容易，但形式会发生变化。 One thing to keep in mind is that it would be a mistake to pass the number of bits for shifting as a runtime parameter to a shr function. 要记住的一件事是，将作为运行时参数转换的位数传递给shr函数是错误的。 Shift instructions by a variable number of bits are slower than the ones using a constant number on most architectures. 在大多数体系结构中，可变位数的移位指令比使用常数的位移慢。 You could use a non-type template parameter to generate different functions at compile time - this is C++, after all... 您可以使用非类型模板参数在编译时生成不同的函数 - 毕竟这是C ++ ...

I think using such a function makes sense on all architectures except IntelSB, where SHRD is already as fast as it can get. 我认为使用这样的功能对除IntelSB之外的所有架构都有意义，其中SHRD已经尽可能快。 On AMD, it will definitely be an improvement. 在AMD上，它肯定会有所改进。 Less so on IntelH: for our case, I don't think it will make a difference, but generally it could shave once cycle off some calculations; 对于IntelH来说不太一样：对于我们的情况，我认为它不会产生任何影响，但通常它可以在一些计算周期中削减; there could theoretically be cases where it could make things slightly worse, but I think those are very uncommon (as usual, there's no substitute for measuring). 从理论上讲，它可能会使事情稍微恶化，但我认为这些情况非常罕见（像往常一样，没有任何替代措施）。 I don't think it will make a difference for our loop because it will change things from [ nh being ready after once cycle and nl after three] to [both being ready after two]; 我认为它不会对我们的循环产生影响，因为它会改变[ nh在一次循环之后准备就绪并且在三次之后nl ]到[两者之后都准备就绪]; this means all three multiplications for the next iteration will be ready at the same time and they'll have to wait for one another, essentially wasting the cycle that was gained by the shift. 这意味着下一次迭代的所有三次乘法都将在同一时间准备就绪，它们必须等待彼此，基本上浪费了转换所获得的周期。

GCC seems to use SHRD on all architectures, and the "assembly in C++" code above can be used as an optimization where it makes sense. GCC似乎在所有体系结构上使用SHRD ，并且上面的“C ++汇编”代码可以用作有意义的优化。 The LLVM optimizer uses a more nuanced approach: it does the optimization (replaces SHRD ) automatically for AMD, but not for Intel, where it even reverses it, as mentioned above. LLVM优化器采用了一种更细微的方法：它为AMD自动进行优化（替换SHRD ），但不是针对英特尔，它甚至可以反转它，如上所述。 This may change in future releases, as indicated by the discussion on the patch for LLVM that implemented this optimization. 这可能会在将来的版本中发生变化，正如对实现此优化的LLVM补丁的讨论所表明的那样。 For now, if you want to use the alternative with LLVM on Intel, you'll have to resort to assembly code. 目前，如果要在Intel上使用LLVM的替代方案，则必须使用汇编代码。

2. Optimizing the multiplication: The test code uses a 128-bit integer for i , but that's not needed in this case, as its value fits easily in 64 bits (32, actually, but that doesn't help us here). 2.优化乘法：测试代码对i使用128位整数，但在这种情况下不需要，因为它的值很容易适合64位（实际上是32，但这对我们没有帮助）。 What this means is that ih will always be zero; 这意味着的是， ih永远是零; this reduces the diagram for 128-bit multiplication to the following: 这将128位乘法的图减少到以下：

nh * il
        \
         \
          \
           + -> next nh
          /
    high 64 bits
        /
nl * il 
        \
     low 64 bits 
          \
            -> next nl

Normally, I'd just say "declare i as long long and let the compiler optimize things" but unfortunately this doesn't work here; 通常情况下，我只是说“声明i long long并让编译器优化”但不幸的是，这在这里不起作用; both compilers go for the standard behaviour of converting the two operands to their common type before doing the calculation, so i ends up on 128 bits even if it starts on 64. We'll have to do things the hard way: 在进行计算之前，两个编译器都会考虑将两个操作数转换为它们的公共类型的标准行为，所以即使它从64开始， i最终也会得到128位。我们必须采取艰难的方式做事：

__int128_t mul(__int128_t n, long long i)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   __asm__(R"(
    movq    %0, %%r10
    imulq   %2, %%r10
    mulq    %2
    addq    %%r10, %0
   )" : "+d"(nh), "+a"(nl) : "r"(i) : "%r10");

   //Pack the result.
   return static_cast<__int128_t>(nh) << 64 | nl;
}

I said I tried to avoid assembly code in this answer, but it's not always possible. 我说我试图在这个答案中避免使用汇编代码，但并不总是可行的。 I managed to coax GCC into generating the right code with "assembly in C++" for the function above, but once the function is inlined everything falls apart - the optimizer sees what's going on in the complete loop body and converts everything to 128 bits. 我设法哄骗GCC使用“C ++汇编”为上面的函数生成正确的代码，但是一旦函数内联，一切就会崩溃 - 优化器会看到整个循环体中发生了什么，并将所有内容转换为128位。 LLVM seems to behave in this case, but, since I was testing on GCC, I had to use a reliable way to get the right code in there. LLVM似乎在这种情况下表现，但是，由于我在GCC上进行测试，我不得不使用一种可靠的方法来获取正确的代码。

Declaring i as long long and using this function instead of the normal multiplication operator, I measured 5 cycles per iteration for the first sample and 7 cycles for the second one on IntelSB, a gain of one cycle in each case. 声明i long long并且使用此函数而不是正常乘法运算符，我在第一个样本中每次迭代测量5个周期，在IntelSB上测量第二个周期7个周期，每种情况下增益为一个周期。 I expect it to shave one cycle off the iterations for both examples on IntelH as well. 我希望它能够在IntelH上的两个示例的迭代中减少一个周期。

3. The loop can sometimes be restructured to encourage pipelined execution, when (at least some) iterations don't really depend on previous results, even though it may look like they do. 3.循环有时可以重构以鼓励流水线执行，当（至少某些）迭代不真正依赖于先前的结果时，即使它看起来像它们那样。 For example, we could replace the for loop for the second sample with something like this: 例如，我们可以使用以下内容替换第二个示例的for循环：

__int128_t n2 = 1;
long long j = 1000000000 / 2;
for(long long i = 1; i < 1000000000 / 2; ++i, ++j)
{
   n *= i;
   n2 *= j;
   n >>= 2;
   n2 >>= 2; 
}
n *= (n2 * j) >> 2;

This takes advantage of the fact that some partial results can be calculated independently and only aggregated at the end. 这利用了以下事实：可以独立计算某些部分结果，并且仅在最后汇总。 We're also hinting to the compiler that we want to pipeline the multiplications and shifts (not always necessary, but it does make a small difference for GCC for this code). 我们也暗示了我们想要对乘法和移位进行流水线处理的编译器（并非总是必要的，但对于此代码，它确实对GCC产生了小的差异）。

The code above is nothing more than a naive proof of concept. 上面的代码只不过是一个天真的概念证明。 Real code would need to handle the total number of iterations in a more reliable way. 实际代码需要以更可靠的方式处理迭代总数。 The bigger problem is that this code won't generate the same results as the original, because of different behaviour in the presence of overflow and rounding. 更大的问题是这个代码不会产生与原始代码相同的结果，因为存在溢出和舍入时的不同行为。 Even if we stop the loop on the 51st iteration, to avoid overflow, the result will still be different by about 10%, because of rounding happening in different ways when shifting right. 即使我们在第51次迭代中停止循环，为了避免溢出，结果仍然会有大约10％的差异，因为在向右移动时以不同的方式进行舍入。 In real code, this would most likely be a problem; 在实际代码中，这很可能是一个问题; but then again, you wouldn't have real code like this, would you? 但话又说回来，你不会有这样的真实代码，不是吗？

Assuming this technique is applied to a case where the problems above don't occur, I measured the performance of such code in a few cases, again on IntelSB. 假设这种技术适用于没有出现上述问题的情况，我在一些情况下再次在IntelSB上测量了这种代码的性能。 The results are given in "cycles per iteration", as before, where "iteration" means the one from the original code (I divided the total number of cycles for executing the whole loop by the total number of iterations executed by the original code, not for the restructured one, to have a meaningful comparison): 结果在“每次迭代的循环”中给出，如前所述，其中“迭代”表示原始代码中的一个（我将执行整个循环的循环总数除以原始代码执行的迭代总数，不是重组后的，要进行有意义的比较）：

The code above executes in 7 cycles per iteration, a gain of one cycle over the original. 上面的代码每次迭代执行7个周期，比原始代码增加一个周期。
The code above with the multiplication operator replaced with our mul() function needs 6 cycles per iteration. 使用我们的mul()函数替换乘法运算符的上述代码每次迭代需要6个周期。

The restructured code does suffer from more register shuffling, which can't be avoided unfortunately (more variables). 重构的代码确实遭受了更多的寄存器重排，遗憾的是这是不可避免的（更多变量）。 More recent processors like IntelH have architecture improvements that make register moves essentially free in many cases; 像IntelH这样的最新处理器具有架构改进，在许多情况下使寄存器移动基本上免费; this could make the code yield even larger gains. 这可能会使代码产生更大的收益。 Using new instructions like MULX for IntelH could avoid some register moves altogether; 使用像MULX for IntelH这样的新指令可以完全避免一些寄存器移动; GCC does use such instructions when compiling with -march=haswell . 在使用-march=haswell编译时，GCC确实使用了这些指令。

Unanswered questions 未回答的问题

None of the measurements that we have so far explain the large differences in performance reported by the OP, and observed by me on a different system. 到目前为止，我们所测量的测量结果都没有解释OP报告的性能差异，并且我在不同系统上观察到。

My initial timings were taken on a remote system (Westmere family processor) where, of course, a lot of things could happen; 我最初的时间是在远程系统（Westmere系列处理器）上进行的，当然，很多事情都可能发生; yet, the results were strangely stable. 然而，结果非常稳定。

On that system, I also experimented with executing the second sample with a right shift and a left shift; 在那个系统上，我还尝试用右移和左移执行第二个样本; the code using a right shift was consistently 50% slower than the other variant. 使用右移的代码始终比其他变体慢50％。 I couldn't replicate that on my controlled test system on IntelSB, and I don't have an explanation for those results either. 我无法在IntelSB上的受控测试系统上复制它，我也没有对这些结果的解释。

We can discard all of the above as unpredictable side effects of compiler / processor / overall system behaviour, but I can't shake the feeling that not everything has been explained here. 我们可以抛弃上述所有这些作为编译器/处理器/整体系统行为的不可预测的副作用，但我不能动摇这种感觉，并非所有内容都在这里解释过。

It's true that it doesn't really make much sense to benchmark such tight loops without a controlled system, precise tools (counting cycles) and looking at the generated assembly code for each case. 确实，如果没有受控系统，精确工具（计数周期）并查看每种情况下生成的汇编代码，那么对这样的紧密循环进行基准测试并没有多大意义。 Compiler idiosyncrasies can easily result in code that artificially introduces differences of 50% or more in performance. 编译器特性很容易导致代码人为地引入性能差异达到50％或更多。

Another factor that could explain large differences is the presence of Intel Hyper-Threading. 可以解释巨大差异的另一个因素是英特尔超线程的存在。 Different parts of the core behave differently when this is enabled, and the behaviour has also changed between processor families. 启用此选项后，核心的不同部分的行为会有所不同，并且处理器系列之间的行为也会发生变化。 This could have strange effects on tight loops. 这可能会对紧密循环产生奇怪的影响。

To top everything off, here's a crazy hypothesis: Flipping bits consumes more power than keeping them constant. 最重要的是，这是一个疯狂的假设：翻转比消耗更多的能量比保持它们不变。 In our case, the first sample, working with zero values most of the time, will be flipping far fewer bits than the second one, so the latter will consume more power. 在我们的例子中，第一个样本在大多数情况下使用零值，将比第二个样本翻转的位数少得多，因此后者将消耗更多的功率。 Many modern processors have features that dynamically adjust the core frequency depending on electrical and thermal limits (Intel Turbo Boost / AMD Turbo Core). 许多现代处理器具有根据电气和热限制（Intel Turbo Boost / AMD Turbo Core）动态调整核心频率的功能。 This means that, theoretically, under the right (or wrong?) conditions, the second sample could trigger a reduction of the core frequency, thus making the same number of cycles take longer time, and making the performance data-dependent. 这意味着，理论上，在正确（或错误的？）条件下，第二个样本可能会触发核心频率的降低，从而使相同数量的周期需要更长的时间，并使性能数据依赖。

Answer 2

After benchmarking both (using the assembly generated by GCC 4.7.3 on -O2) on my 4770K, I found that the first one takes 5 cycles per iteration and the second one takes 9 cycles per iteration. 在我的4770K上对两者进行基准测试（使用GCC 4.7.3在-O2上生成的汇编）后，我发现第一个每次迭代需要5个周期，第二个每次迭代需要9个周期。 Why so much difference? 为什么这么大的差异？

It turns out to be an interplay between throughput and latency. 事实证明，这是吞吐量和延迟之间的相互作用。 The main killer is shrd , which takes 3 cycles and is on the critical path. 主要的杀手是shrd ，需要3个周期才能进入关键路径。 Here's a picture of it (I ignore the chain for i because it is faster and there is plenty of spare throughput for it to just run ahead, it will not interfere): 这是它的图片（我忽略了i的链，因为它更快，并且有足够的备用吞吐量，它只是向前运行，它不会干扰）：

依赖链

The edges here are dependencies, not dataflow. 这里的边缘是依赖关系，而不是数据流。

Based solely on latencies in this chain, the expected time would be 8 cycles per iteration. 仅基于该链中的延迟，预期时间为每次迭代8个周期。 But it is not. 但事实并非如此。 The problem here is that for 8 cycles to happen, mul2 and imul3 have to be executed in parallel, and integer multiplication only has a throughput of 1/cycle. 这里的问题是，要发生8个周期， mul2和imul3必须并行执行，整数乘法只有1 /周期的吞吐量。 So it (either one) has to wait a cycle, and holds up the chain by a cycle. 所以它（任何一个）必须等待一个循环，并按周期保持链。 I verified this by changing that imul to an add , which reduced the time to 8 cycles per iteration. 我通过改变验证此imul到一个add ，从而减少时间每次迭代8次循环。 Changing the other imul to an add had no effect, as predicted based on this explanation (it doesn't depend on shrd and can thus be scheduled earlier, without interfering with the other multiplications). 将其他imul更改为add没有任何影响，正如基于此解释所预测的那样（它不依赖于shrd ，因此可以提前调度，而不会干扰其他乘法）。

These exact details are only for Haswell. 这些确切的细节仅适用于Haswell。

The code I used was this: 我使用的代码是这样的：

section .text

global cmp1
proc_frame cmp1
[endprolog]
    mov r8, rsi
    mov r9, rdi
    mov esi, 1
    xor edi, edi
    mov eax, 128
    xor edx, edx
.L2:
    mov rcx, rdx
    mov rdx, rdi
    imul    rdx, rax
    imul    rcx, rsi
    add rcx, rdx
    mul rsi
    add rdx, rcx
    add rsi, 1
    mov rcx, rsi
    adc rdi, 0
    xor rcx, 10000000
    or  rcx, rdi
    jne .L2
    mov rdi, r9
    mov rsi, r8
    ret
endproc_frame

global cmp2
proc_frame cmp2
[endprolog]
    mov r8, rsi
    mov r9, rdi
    mov esi, 1
    xor edi, edi
    mov eax, 128
    xor edx, edx
.L3:
    mov rcx, rdi
    imul    rcx, rax
    imul    rdx, rsi
    add rcx, rdx
    mul rsi
    add rdx, rcx
    shrd    rax, rdx, 2
    sar rdx, 2
    add rsi, 1
    mov rcx, rsi
    adc rdi, 0
    xor rcx, 10000000
    or  rcx, rdi
    jne .L3
    mov rdi, r9
    mov rsi, r8
    ret
endproc_frame

Answer 3

Unless your processor can support native 128-bit operations, the operations will have to be software coded to use the next best option. 除非您的处理器能够支持本机128位操作，否则操作必须通过软件编码才能使用下一个最佳选项。

Your 128-bit operations are using the same scheme as the 8-bit processors did when using 16-bit operations, and this takes time. 您的128位操作使用与使用16位操作时8位处理器相同的方案，这需要时间。

For example, a 128-bit right shift, by one bit, using 64-bit registers requires: 例如，使用64位寄存器的128位右移一位需要：
Shift the Most Significant register right into carry. 将最高有效寄存器右移为进位。 The Carry flag will contain the bit that was shifted out. Carry标志将包含被移出的位。
Shift the Least Significant register right, with carry. 将最低有效寄存器右移，带进位。 The bits will be shifted right, with the carry flag being shifted into the Most Significant Bit position. 这些位将向右移位，进位标志位移到最高位位置。

Without support for native 128-bit operations, you code will take twice as many operations as the same 64-bit operations; 如果不支持本机128位操作，则代码将占用相同64位操作的两倍的操作; sometimes more (such as multiplication). 有时候更多（比如乘法）。 This is why you are seeing such poor performance. 这就是为什么你看到这样糟糕的表现。

I highly recommend only using 128-bits in places where it is extremely necessary. 我强烈建议只在非常必要的地方使用128位。

C ++优化之谜

问题描述

3 个解决方案

解决方案1
12 已采纳 2015-03-06 21:36:19

What the code does 代码的作用

What the compiler does 编译器做了什么

What the processor does 处理器的功能

What you can do about it 你可以做些什么呢

Unanswered questions 未回答的问题

解决方案2
6 2015-03-06 22:12:47

解决方案3
1 2015-03-06 19:43:44

C ++优化之谜

问题描述

3 个解决方案

解决方案1 12 已采纳 2015-03-06 21:36:19

What the code does 代码的作用

What the compiler does 编译器做了什么

What the processor does 处理器的功能

What you can do about it 你可以做些什么呢

Unanswered questions 未回答的问题

解决方案2 6 2015-03-06 22:12:47

解决方案3 1 2015-03-06 19:43:44

解决方案1
12 已采纳 2015-03-06 21:36:19

解决方案2
6 2015-03-06 22:12:47

解决方案3
1 2015-03-06 19:43:44