简体   繁体   English

乘法如何比向左移位快?

[英]How can the multiplication be faster than shifting bits to the left?

It is well know that shifting bits to the left is faster than multiply because barrel shifters are implemented directly in the hardware. 众所周知,向左移位比乘法快,因为桶形移位器直接在硬件中实现。 Therefore, this simple benchmark should be wrong: 因此,这个简单的基准应该是错误的:

$start = 1;

$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++) {
    $result2 = $start << 2;
}
echo microtime(1) - $timestart;

$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++) {
    $result1 = $start * 4;
}
echo microtime(1) - $timestart;
echo "\n";

Because I executed it multiple times and always multiplying was faster than shifting bits to the left. 因为我执行了多次,并且总是乘起来比向左移位快。 For example: 例如:

0.73733711242676 0.73733711242676

0.71091389656067 0.71091389656067

Therefore, or the benchmark is wrong or the PHP interpreter is doing something here. 因此,或者基准测试是错误的,或者PHP解释器正在此处执行某些操作。 The test is executed by PHP 7.0.32 running in Ubuntu: 该测试由在Ubuntu中运行的PHP 7.0.32执行:

PHP 7.0.32-0ubuntu0.16.04.1 (cli) ( NTS ) PHP 7.0.32-0ubuntu0.16.04.1(cli)(NTS)

CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz CPU:3.20GHz时的Intel(R)Core(TM)i5-4460 CPU

Edit: 编辑:

Executing it in a Windows box, with almost the same CPU (Intel(R) Core(TM) i5-4460S CPU @2.90GHz) the results are like expected: 在具有几乎相同的CPU(Intel(R)CoreTM i5-4460S CPU @ 2.90GHz)的Windows盒中执行它,结果如预期的那样:

0.24960112571716 0.24960112571716

0.28080010414124 0.28080010414124

The PHP version for this case is different: 这种情况下的PHP版本是不同的:

PHP 7.1.19 (cli) (built: Jun 20 2018 23:24:42) ( ZTS MSVC14 (Visual C++ 2015) x64 ) PHP 7.1.19(CLI)(内置:2018年6月20日23:24:42)(ZTS MSVC14(Visual C ++ 2015)x64)

Your reasoning about hardware is basically irrelevant. 您关于硬件的推理基本上是无关紧要的。 You're using an interpreted language where most of the cost is interpreter overhead. 您正在使用一种解释语言,其中大部分成本是解释器开销。

An asm version of either loop could run at 1 per clock (assuming a fixed-count shift), so only 100k iterations would take (on a 3GHz CPU) 0.033 ms, or 0.000033 seconds, ~250 times faster than your PHP times. 每个循环的一个asm版本每个时钟可以运行1次(假设有固定计数的移位),因此(在3GHz CPU上)仅进行100k次迭代就需要0.033毫秒(即0.000033秒),比PHP的速度快250倍。


Also, an interpreted loop has to use a variable-count shift (because it can't JIT-compile the shift count into an immediate in the machine code), which is actually more expensive for throughput (3 uops) on Intel CPUs because of x86 legacy baggage (flag semantics). 而且,解释循环必须使用可变计数移位(因为它不能将移位计数JIT编译到机器代码的立即数中),实际上这对于Intel CPU的吞吐量(3微秒)而言更加昂贵x86传统行李(标志语义)。 AMD CPUs have single-uop shifts even for variable shift counts. AMD CPU甚至具有可变移位计数,也具有单uup移位。 ( shl reg, cl vs. shr reg, imm8 ). shl reg, cl vs. shr reg, imm8 )。 See INC instruction vs ADD 1: Does it matter? 参见INC指令与ADD 1:有关系吗? for more about why shl reg,cl is 3 uops on Sandybridge-family, and how it could create a false dependency through flags) 进一步了解为什么shl reg,cl在Sandybridge家族上为3 uops,以及如何通过标志创建虚假依赖关系)

Integer multiply is 1 uop, 1-per-clock throughput, 3 cycle latency, on Intel Sandybridge-family and AMD Ryzen. 在Intel Sandybridge系列和AMD Ryzen上,整数乘法是1 uop,每时钟1个吞吐量,3个周期延迟。 I per 2 clocks on AMD Bulldozer-family, not fully pipelined. 我每2个时钟在AMD Bulldozer系列上运行一次,但未完全流水线化。 So yes, multiply has higher latency, but they're both fully pipelined for throughput. 因此,是的,乘法具有更高的延迟,但是它们都已完全流水线化以提高吞吐量。 Your loop throws away the result, so there's no loop-carried dependency chain so latency is irrelevant (and hidden by out-of-order execution). 您的循环会丢弃结果,因此没有循环承载的依赖关系链,因此延迟是无关紧要的(并且由于乱序执行而被隐藏)。

But that minor difference (2 extra uops) is not enough to account for the measured difference. 但是,这个微小的差异(额外的2个ouop)不足以解决所测得的差异。 The actual shift or multiply is only 1/250th of the total cycles the loop takes. 实际的移位或乘法运算仅是循环总周期的1/250。 You say switching the order of the loops doesn't change the result, so it's not just a warm-up effect before your CPU ramps up to max clock speed. 您说切换循环的顺序不会改变结果,因此在CPU加速到最大时钟速度之前,这不仅仅是预热效果。

You haven't mentioned what CPU microarchitecture you're running on, but the answer probably doesn't depend on how shift vs. multiply instructions decode. 您没有提到正在运行的CPU微体系结构,但是答案可能并不取决于移位与乘法指令的解码方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM