为什么 C++ 双精度类型值的乘法与加法一样快？

Question

#include<vector>
#include<iostream>
#include<random>
#include<chrono>

int main()
{
    int i;
    std::mt19937 rng(std::chrono::system_clock::now().time_since_epoch().count());
    std::uniform_real_distribution<double> dist(0.5, 1);
    std::vector<double> q;
    int N = 100000000;
    for (i = 0; i < N; ++i) q.emplace_back(dist(rng));

    double sum = 0;

    auto start = std::chrono::steady_clock::now();
    for (i = 1; i < 100000000; ++i) {
        sum += q[i] + q[i - 1]; // change + to - or * or /, it takes same time.
    }

    auto end = std::chrono::steady_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << std::endl;
    std::cout << sum << std::endl;

}

Addition and subtraction should be simple process, maybe some shifts and bitwise operations whose cost is proportional to precision.加法和减法应该是简单的过程，可能是一些移位和按位运算，其成本与精度成正比。

While multiplication and divisions are naturally more complicated process.而乘法和除法自然是更复杂的过程。 Say for multiplication, it seems natural for it to be magnitude slower (something like O(n^2) if addition takes O(n), as multiplication can be broken down into additions of shifted values) For division, it should be even harder.说乘法，它的数量级要慢一些似乎很自然（如果加法需要 O（n），则类似于 O（n^2），因为乘法可以分解为移位值的加法）对于除法，它应该更难.

Yet for all 4 arithmetic operations using double type values, this code takes ~110ms, with optimization.然而，对于使用双精度类型值的所有 4 个算术运算，此代码需要大约 110 毫秒，并进行了优化。 How is this possible?这怎么可能？ What is magic going on here that allows C++ to handle multiplication as quickly as addition, ...or handle addition as slowly as multiplication?这里发生了什么神奇的事情，让 C++ 能够像加法一样快地处理乘法，或者像乘法一样缓慢地处理加法？

ps for integer, it takes ~twice time, only for division. ps 对于 integer，它需要〜两倍的时间，仅用于除法。

Answer 1

On some processors, floating-point multiplication is as fast as addition because:在某些处理器上，浮点乘法与加法一样快，因为：

The hardware designers have put a lot of logic gates into the floating-point units.硬件设计人员在浮点单元中放置了很多逻辑门。
Instructions may be divided into multiple stages which are executed in a pipeline.指令可以分为多个阶段，在流水线中执行。 For example, a multiplication may perform part of its work in a unit M0, then pass the results to a unit M1 that does another part, then M2, then M3.例如，乘法可以在单元 M0 中执行其部分工作，然后将结果传递给执行另一部分的单元 M1，然后是 M2，然后是 M3。 While M1 is working on its part, M0 can start work on a different multiplication.在 M1 工作时，M0 可以开始工作在不同的乘法上。 With this arrangement, a multiplication may actually take four processor cycles to complete, but, because there are four units working on four stages, the processor can finish one multiplication every cycle.使用这种安排，一次乘法实际上可能需要四个处理器周期才能完成，但是，因为有四个单元在四个阶段上工作，处理器可以在每个周期完成一次乘法。 In contrast, a simpler instruction like XOR has just one stage.相反，像 XOR 这样更简单的指令只有一个阶段。
Although some instructions could be executed quickly and some require more time, the entire processor is synchronized by a clock, and every pipeline stage in each execution unit has to finish its work in one clock cycle.虽然有些指令可以快速执行，有些则需要更多时间，但整个处理器是由一个时钟同步的，每个执行单元中的每个流水线阶段都必须在一个时钟周期内完成它的工作。 This imposes some rigidity on the processor design—some simple operations will finish their work before a clock cycle is over, while complicated operations need the entire cycle.这给处理器设计带来了一定的刚性——一些简单的操作会在一个时钟周期结束之前完成它们的工作，而复杂的操作则需要整个周期。 The designers make decisions about how long to make a clock cycle.设计人员决定创建一个时钟周期需要多长时间。 If a clock cycle is too short (relative to the speed at which the logic gates work), then many instructions have to take multiple cycles and may require additional overhead to manage them.如果时钟周期太短（相对于逻辑门的工作速度），那么许多指令必须占用多个周期，并且可能需要额外的开销来管理它们。 If a clock cycle is too long, then time is wasted needless waiting for instructions that could have completed sooner.如果一个时钟周期太长，那么时间就被浪费在等待本可以更快完成的指令上。 For current processor technology, it is common that the floating-point multiplier stages work well with the processor cycle time.对于当前的处理器技术，浮点乘法器级通常与处理器周期时间配合良好。

Nonetheless, you may see differences between the times of addition and multiplication.尽管如此，您可能会看到加法和乘法时间之间的差异。 Current processor designs are quite complicated, and processors typically have multiple units for doing various floating-point operations.当前的处理器设计相当复杂，处理器通常具有多个单元来执行各种浮点运算。 A processor could have more units for doing addition than it does for doing multiplication, so it would be able to do more additions per unit of time than multiplications.一个处理器可以有更多的单位来做加法而不是做乘法，所以它可以在单位时间内做比乘法更多的加法。

However, observe the expression you are using:但是，请注意您使用的表达式：

sum += q[i] + q[i - 1];

This causes sum to be serially dependent on its prior value.这导致sum串行依赖于它的先前值。 The processor can add q[i] to q[i-1] without waiting for prior additions, but then, to add to sum , it must wait for the prior add to sum to complete.处理器可以将q[i]与q[i-1]相加，而无需等待先前的加法，但是，要与sum ，它必须等待先前的与sum完成。 This means that, if a processor has two units for addition, it could be working on both q[i] + q[i-1] and the prior addition to sum at the same time.这意味着，如果一个处理器有两个加法单元，它可以同时处理q[i] + q[i-1]和之前的sum运算。 But, if it had more addition units, it could not go any faster.但是，如果它有更多的附加单元，它不能更快地 go。 It could use the extra units to do more of those q[i] + q[i - 1] additions for different values of i , but every addition to sum has to wait for the previous one.它可以使用额外的单元来为i的不同值做更多的q[i] + q[i - 1]加法，但是sum的每个加法都必须等待前一个加法。 Therefore, with two or more addition units, this computation is dependent on the latency of addition, which is how long it takes to do a single addition.因此，对于两个或更多加法单元，此计算取决于加法的延迟，即进行一次加法所需的时间。 (This is in contrast to the throughput of addition, which is how many additions the processor can do per unit of time, if there is no serial dependency.) （这与加法的吞吐量相反，如果没有串行依赖，处理器在单位时间内可以进行多少次加法。）

If you used a different computation, such as sum += q[i];如果您使用不同的计算，例如sum += q[i]; or sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3];或sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3]; sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3]; , then you could see different times for addition and multiplication that depended on how many addition units and how many multiplication units the processor had. ，然后您可以看到不同的加法和乘法时间，具体取决于处理器有多少个加法单元和多少个乘法单元。

为什么 C++ 双精度类型值的乘法与加法一样快？

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-10-12 10:45:05

为什么 C++ 双精度类型值的乘法与加法一样快？

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-10-12 10:45:05

解决方案1
2 已采纳 2019-10-12 10:45:05