#include<vector>
#include<iostream>
#include<random>
#include<chrono>
int main()
{
int i;
std::mt19937 rng(std::chrono::system_clock::now().time_since_epoch().count());
std::uniform_real_distribution<double> dist(0.5, 1);
std::vector<double> q;
int N = 100000000;
for (i = 0; i < N; ++i) q.emplace_back(dist(rng));
double sum = 0;
auto start = std::chrono::steady_clock::now();
for (i = 1; i < 100000000; ++i) {
sum += q[i] + q[i - 1]; // change + to - or * or /, it takes same time.
}
auto end = std::chrono::steady_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << std::endl;
std::cout << sum << std::endl;
}
Addition and subtraction should be simple process, maybe some shifts and bitwise operations whose cost is proportional to precision.
While multiplication and divisions are naturally more complicated process. Say for multiplication, it seems natural for it to be magnitude slower (something like O(n^2) if addition takes O(n), as multiplication can be broken down into additions of shifted values) For division, it should be even harder.
Yet for all 4 arithmetic operations using double type values, this code takes ~110ms, with optimization. How is this possible? What is magic going on here that allows C++ to handle multiplication as quickly as addition, ...or handle addition as slowly as multiplication?
ps for integer, it takes ~twice time, only for division.
On some processors, floating-point multiplication is as fast as addition because:
Nonetheless, you may see differences between the times of addition and multiplication. Current processor designs are quite complicated, and processors typically have multiple units for doing various floating-point operations. A processor could have more units for doing addition than it does for doing multiplication, so it would be able to do more additions per unit of time than multiplications.
However, observe the expression you are using:
sum += q[i] + q[i - 1];
This causes sum
to be serially dependent on its prior value. The processor can add q[i]
to q[i-1]
without waiting for prior additions, but then, to add to sum
, it must wait for the prior add to sum
to complete. This means that, if a processor has two units for addition, it could be working on both q[i] + q[i-1]
and the prior addition to sum
at the same time. But, if it had more addition units, it could not go any faster. It could use the extra units to do more of those q[i] + q[i - 1]
additions for different values of i
, but every addition to sum
has to wait for the previous one. Therefore, with two or more addition units, this computation is dependent on the latency of addition, which is how long it takes to do a single addition. (This is in contrast to the throughput of addition, which is how many additions the processor can do per unit of time, if there is no serial dependency.)
If you used a different computation, such as sum += q[i];
or sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3];
sum0 += q[i]; sum1 += q[i+1]; sum2 += q[i+2]; sum3 += q[i+3];
, then you could see different times for addition and multiplication that depended on how many addition units and how many multiplication units the processor had.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.