I have found out that the iteration goes through a vector faster when instead of using a variable (i) to count up std::vector<T>::iterator
is used.
Thanks to a few comments, here is some additional information: (1) I use the Visual Studio C++ Compiler; (2) I compiled in release mode and with the optimization -O2 :)
If the variable i is incremented, the iteration takes
5875ms:
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");
or 5723ms:
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec2[i]->x = 0;
vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
If std::vector<Data>::Iterator
is used to iterate, the iteration will take
29ms:
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");
or 110ms:
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (auto& it : vec2) {
it->x = 0;
it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
Why is the other iteration so much faster?
I'm wondering that the iteration with the variable i at which the data is at different positions in the memory is as fast as the iteration with the variable i, where data is juxtaposed in the memory. The fact that the data is next to each other in the memory should reduce cache misses and that works with the iteration with std::vector<Data>::Iterator
, why not with the other one? Or do I dare and the distance of 29 to 110ms is not the cache misses in debt?
The entire program looks like this:
#include <iostream>
#include <chrono>
#include <vector>
#include <string>
class StopWatch
{
public:
void start() {
this->t1 = std::chrono::high_resolution_clock::now();
}
void stop() {
this->t2 = std::chrono::high_resolution_clock::now();
this->diff = t2 - t1;
}
void printSpanAsMs(std::string startText = "time span") {
long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
(diff).count();
std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
}
private:
std::chrono::high_resolution_clock::time_point t1, t2;
std::chrono::high_resolution_clock::duration diff;
} stopWatch;
struct Data {
int x, y;
};
const unsigned long MAX_DATA = 20000000;
void test1()
{
std::cout << "1. Test \n Use i to iterate through the vector" <<
std::endl;
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each
other");
//////////////////////////////////////////////////
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec2[i]->x = 0;
vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
for (unsigned i = 0U; i < MAX_DATA; ++i) {
delete vec2[i];
vec2[i] = nullptr;
}
}
void test2()
{
std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through
the vector" << std::endl;
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each
other");
//////////////////////////////////////////////////
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (auto& it : vec2) {
it->x = 0;
it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
for (auto& it : vec2) {
delete it;
it = nullptr;
}
}
int main()
{
test1();
test2();
system("PAUSE");
return 0;
}
Why is the other iteration so much faster?
The reason is that MSVC 2017 cannot optimize it properly.
In the first case it completely fails to optimize the loop:
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
Generated code ( live demo ):
xor r9d, r9d
mov eax, r9d
$LL4@test1:
mov rdx, QWORD PTR [rcx]
lea rax, QWORD PTR [rax+16]
mov DWORD PTR [rax+rdx-16], r9d
mov rdx, QWORD PTR [rcx]
mov DWORD PTR [rax+rdx-12], r9d
mov rdx, QWORD PTR [rcx]
mov DWORD PTR [rax+rdx-8], r9d
mov rdx, QWORD PTR [rcx]
mov DWORD PTR [rax+rdx-4], r9d
sub r8, 1
jne SHORT $LL4@test1
Replacing unsigned i
with size_t i
or hoisting indexed access into a reference doesn't help ( demo ).
The only thing that helps is using an iterator like you have already found out:
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
Generated code ( live demo ):
xor ecx, ecx
npad 2
$LL4@test2:
mov QWORD PTR [rax], rcx
add rax, 8
cmp rax, rdx
jne SHORT $LL4@test2
clang just calls memset
in both cases.
The moral of the story: look at the generated code if you care about performance. Report issues to the vendor.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.