简体   繁体   中英

How come multiplication is as fast as addition for C++ double type values?

#include<vector>
#include<iostream>
#include<random>
#include<chrono>

int main()
{
    int i;
    std::mt19937 rng(std::chrono::system_clock::now().time_since_epoch().count());
    std::uniform_real_distribution<double> dist(0.5, 1);
    std::vector<double> q;
    int N = 100000000;
    for (i = 0; i < N; ++i) q.emplace_back(dist(rng));

    double sum = 0;

    auto start = std::chrono::steady_clock::now();
    for (i = 1; i < 100000000; ++i) {
        sum += q[i] + q[i - 1]; // change + to - or * or /, it takes same time.
    }

    auto end = std::chrono::steady_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << std::endl;
    std::cout << sum << std::endl;

}

Addition and subtraction should be simple process, maybe some shifts and bitwise operations whose cost is proportional to precision.

While multiplication and divisions are naturally more complicated process. Say for multiplication, it seems natural for it to be magnitude slower (something like O(n^2) if addition takes O(n), as multiplication can be broken down into additions of shifted values) For division, it should be even harder.

Yet for all 4 arithmetic operations using double type values, this code takes ~110ms, with optimization. How is this possible? What is magic going on here that allows C++ to handle multiplication as quickly as addition, ...or handle addition as slowly as multiplication?

ps for integer, it takes ~twice time, only for division.

On some processors, floating-point multiplication is as fast as addition because:

  • The hardware designers have put a lot of logic gates into the floating-point units.
  • Instructions may be divided into multiple stages which are executed in a pipeline. For example, a multiplication may perform part of its work in a unit M0, then pass the results to a unit M1 that does another part, then M2, then M3. While M1 is working on its part, M0 can start work on a different multiplication. With this arrangement, a multiplication may actually take four processor cycles to complete, but, because there are four units working on four stages, the processor can finish one multiplication every cycle. In contrast, a simpler instruction like XOR has just one stage.
  • Although some instructions could be executed quickly and some require more time, the entire processor is synchronized by a clock, and every pipeline stage in each execution unit has to finish its work in one clock cycle. This imposes some rigidity on the processor design—some simple operations will finish their work before a clock cycle is over, while complicated operations need the entire cycle. The designers make decisions about how long to make a clock cycle. If a clock cycle is too short (relative to the speed at which the logic gates work), then many instructions have to take multiple cycles and may require additional overhead to manage them. If a clock cycle is too long, then time is wasted needless waiting for instructions that could have completed sooner. For current processor technology, it is common that the floating-point multiplier stages work well with the processor cycle time.

Nonetheless, you may see differences between the times of addition and multiplication. Current processor designs are quite complicated, and processors typically have multiple units for doing various floating-point operations. A processor could have more units for doing addition than it does for doing multiplication, so it would be able to do more additions per unit of time than multiplications.

However, observe the expression you are using:

sum += q[i] + q[i - 1];

This causes sum to be serially dependent on its prior value. The processor can add q[i] to q[i-1] without waiting for prior additions, but then, to add to sum , it must wait for the prior add to sum to complete. This means that, if a processor has two units for addition, it could be working on both q[i] + q[i-1] and the prior addition to sum at the same time. But, if it had more addition units, it could not go any faster. It could use the extra units to do more of those q[i] + q[i - 1] additions for different values of i , but every addition to sum has to wait for the previous one. Therefore, with two or more addition units, this computation is dependent on the latency of addition, which is how long it takes to do a single addition. (This is in contrast to the throughput of addition, which is how many additions the processor can do per unit of time, if there is no serial dependency.)

If you used a different computation, such as sum += q[i]; or sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3]; sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3]; , then you could see different times for addition and multiplication that depended on how many addition units and how many multiplication units the processor had.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM