AND运算符+加法比减法快

Question

I've measured the execution time of following codes: 我已经测量了以下代码的执行时间：

volatile int r = 768;
r -= 511;

volatile int r = 768;
r = (r & ~512) + 1;

assembly: 部件：

mov     eax, DWORD PTR [rbp-4]
sub     eax, 511
mov     DWORD PTR [rbp-4], eax

mov     eax, DWORD PTR [rbp-4]
and     ah, 253
add     eax, 1
mov     DWORD PTR [rbp-4], eax

the results: 结果：

Subtraction time: 141ns   
AND + addition: 53ns

I've run the snippet multiple times with consistent results. 我已经多次运行了代码段，并获得了一致的结果。
Can someone explain me why is this the case even tho there is one more line of assembly for AND + addition version? 有人可以向我解释为什么这种情况甚至在AND +加法版本中又有一行汇编吗？

Answer 1

Your assertion that one snippet is faster than the other is mistaken. 您关于一个摘要比另一个摘要更快的主张是错误的。
If you look at the code: 如果您看一下代码：

mov     eax, DWORD PTR [rbp-4]
....
mov     DWORD PTR [rbp-4], eax

You'll see that the running time is dominated by the load/store to memory. 您会看到运行时间由加载/存储到内存控制。
Even on Skylake this will take 2+2 = 4 cycles minimum. 即使在Skylake上，这也至少需要2 + 2 = 4个周期。
The 1 cycles that the sub or the 3 ^*) cycles that the and bytereg/add full reg takes simply disappears into memory access time. sub and bytereg/add full reg占用的sub周期或3 ^*）周期会简单地消失在内存访问时间中。
On older processors such as Core2 it takes 5 cycles minimum to do a load/store pair to the same address. 在较旧的处理器（例如Core2）上，最少需要5个周期才能将加载/存储对执行到同一地址。

It is difficult to time such short sequences of code and care should be taken to ensure you have the correct methodology. 很难安排这么短的代码序列，应谨慎操作以确保您拥有正确的方法。
You also need to remember that rdstc is not accurate on Intel processors and runs out of order to boot. 您还需要记住， rdstc在Intel处理器上不正确， rdstc会无法正常启动。

If you use proper timing code like : 如果您使用正确的计时码，例如：

.... x 100,000    //stress the cpu using integercode in a 100,000 x loop to ensure it's running at 100%
cpuid             //serialize instruction to make sure rdtscp does not run early.
rdstcp            //use the serializing version to ensure it does not run late   
push eax
push edx
mov reg1,1000*1000   //time a minimum of 1,000,000 runs to ensure accuracy
loop:
...                  //insert code to time here
sub reg1,1           //don't use dec, it causes a partial register stall on the flags.
jnz loop             //loop
//kernel mode only!
//mov eax,cr0          //reading and writing to cr0 serializes as well.
//mov cr0,eax
cpuid                //serialization in user mode.
rdstcp               //make sure to use the 'p' version of rdstc.
push eax
push edx
pop 4x               //retrieve the start and end times from the stack.

Run the timing code a 100x and take the lowest cycle count. 将时序代码运行100x，并获得最低的周期数。
Now you'll have an accurate count to within 1 or 2 cycles. 现在，您将可以准确计数到1或2个周期内。
You'll want to time an empty loop as well and subtract the times for that so that you can see the net time spend executing the instructions of interest. 您还需要对一个空循环计时，并减去该时间，这样您就可以看到花费在执行感兴趣的指令上的净时间。

If you do this you'll discover that add and sub run at exactly the same speed, just like it does/did in every x86/x64 CPU since the 8086. 如果执行此操作，您会发现add和sub以完全相同的速度运行，就像自8086年以来每个x86 / x64 CPU所做的一样。
This, of course, is also what Agner Fog , the Intel CPU manuals , the AMD cpu manuals , and just about any other source available say. 当然，这也是Agner Fog ， Intel CPU手册， AMD cpu手册以及几乎所有其他可用资料所说的。

*) and ah,value takes 1 cycle, then the CPU stalls for 1 cycle due the partial register write and the add eax,value takes another cycle. *） and ah,value需要1个周期，然后CPU由于部分寄存器写入add eax,value需要一个周期而停顿1个周期。

Optimized code 优化代码

sub     DWORD PTR [rbp-4],511

Might be faster if you don't need to reuse the value elsewhere, the latency is slow at 5 cycles, but the reciprocal throughput is 1 cycle, which is much better than either of your versions. 如果您不需要在其他地方重用该值，则速度可能会更快，延迟在5个周期内很慢，但是互惠吞吐量为1个周期，这比任何一个版本都要好得多。

Answer 2

The full machine code is 完整的机器代码是

8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
2d ff 01 00 00          sub    eax,0x1ff
89 45 fc                mov    DWORD PTR [rbp-0x4],eax

vs VS

8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
80 e4 fd                and    ah,0xfd
83 c0 01                add    eax,0x1
89 45 fc                mov    DWORD PTR [rbp-0x4],eax

This means for the code for the secound operation is in fact only one byte longer (11 vs 12). 这意味着，用于secound操作的代码实际上只长了一个字节（11 vs 12）。 Most likely the CPU fetches code in larger units them bytes, so fetching isn't much slower. CPU很可能以较大的单位（字节）来获取代码，因此获取的速度不会慢很多。 Also it can decode multiple instructions at the same time, so there the first sample doesn't have an advantage either. 而且它可以同时解码多个指令，因此第一个样本也没有优势。 Executing a single add , and or sub each takes up a single ALU pass so they all take only one clock on a single execution unit. 执行单个add and或sub占用一个ALU通道，因此它们在单个执行单元上仅占用一个时钟。 That's a 1 ns advantage for you sub on a 1GHz CPU. 对于使用1GHz CPU的用户来说，这是1 ns的优势。

So basically both operations are more or less the same. 因此，基本上两个操作大致相同。 The difference may be attributed to some other factors. 差异可能归因于其他一些因素。 Maybe memory cell rbp-0x4 is still in L1 cache before your run the secound code sniplet. 可能在运行secound代码片段之前，存储单元rbp-0x4仍在L1高速缓存中。 Or the instructions for the first sniplet are located worse reachable in memory. 或者第一个代码段的说明在内存中更难找到。 Or the CPU was able to run the secound sniplet speculativly before you started measuring etc., you would need to know how you measured the speed etc. to decide that. 或者在开始测量等之前，CPU能够推测性地运行第二段代码，那么您需要知道如何测量速度等才能决定。

AND运算符+加法比减法快

问题描述

2 个解决方案

解决方案1
5 已采纳 2017-03-19 16:25:44

解决方案2
-1 2017-03-19 16:27:36

AND运算符+加法比减法快

问题描述

2 个解决方案

解决方案1 5 已采纳 2017-03-19 16:25:44

解决方案2 -1 2017-03-19 16:27:36

解决方案1
5 已采纳 2017-03-19 16:25:44

解决方案2
-1 2017-03-19 16:27:36