[英]The performance at iteration (cache miss)
我發現迭代通過一個向量更快,而不是使用變量(i)來計算std::vector<T>::iterator
被使用。
感謝一些評論,這里有一些額外的信息:(1)我使用Visual Studio C ++編譯器; (2)我在發布模式下編譯並使用優化-O2 :)
如果變量i遞增,則迭代進行
5875ms:
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");
或5723ms:
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec2[i]->x = 0;
vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
如果使用std::vector<Data>::Iterator
進行迭代,則迭代將采用
29ms:
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");
或110ms:
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (auto& it : vec2) {
it->x = 0;
it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
為什么另一次迭代要快得多?
我想知道,數據位於存儲器中不同位置的變量i的迭代與使用變量i的迭代一樣快,其中數據並置在存儲器中。 數據在內存中彼此相鄰的事實應該減少緩存未命中,並且與使用std::vector<Data>::Iterator
,為什么不與另一個一起使用? 或者我是否敢於和29到110毫秒的距離不是緩存錯失的債務?
整個程序看起來像這樣:
#include <iostream>
#include <chrono>
#include <vector>
#include <string>
class StopWatch
{
public:
void start() {
this->t1 = std::chrono::high_resolution_clock::now();
}
void stop() {
this->t2 = std::chrono::high_resolution_clock::now();
this->diff = t2 - t1;
}
void printSpanAsMs(std::string startText = "time span") {
long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
(diff).count();
std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
}
private:
std::chrono::high_resolution_clock::time_point t1, t2;
std::chrono::high_resolution_clock::duration diff;
} stopWatch;
struct Data {
int x, y;
};
const unsigned long MAX_DATA = 20000000;
void test1()
{
std::cout << "1. Test \n Use i to iterate through the vector" <<
std::endl;
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each
other");
//////////////////////////////////////////////////
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec2[i]->x = 0;
vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
for (unsigned i = 0U; i < MAX_DATA; ++i) {
delete vec2[i];
vec2[i] = nullptr;
}
}
void test2()
{
std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through
the vector" << std::endl;
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each
other");
//////////////////////////////////////////////////
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (auto& it : vec2) {
it->x = 0;
it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
for (auto& it : vec2) {
delete it;
it = nullptr;
}
}
int main()
{
test1();
test2();
system("PAUSE");
return 0;
}
為什么另一次迭代要快得多?
原因是MSVC 2017無法正確優化它。
在第一種情況下,它完全無法優化循環:
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
生成的代碼( 現場演示 ):
xor r9d, r9d
mov eax, r9d
$LL4@test1:
mov rdx, QWORD PTR [rcx]
lea rax, QWORD PTR [rax+16]
mov DWORD PTR [rax+rdx-16], r9d
mov rdx, QWORD PTR [rcx]
mov DWORD PTR [rax+rdx-12], r9d
mov rdx, QWORD PTR [rcx]
mov DWORD PTR [rax+rdx-8], r9d
mov rdx, QWORD PTR [rcx]
mov DWORD PTR [rax+rdx-4], r9d
sub r8, 1
jne SHORT $LL4@test1
用size_t i
替換unsigned i
或將索引訪問提升為引用並沒有幫助( 演示 )。
唯一有用的是使用像你已經發現的迭代器:
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
生成的代碼( 現場演示 ):
xor ecx, ecx
npad 2
$LL4@test2:
mov QWORD PTR [rax], rcx
add rax, 8
cmp rax, rdx
jne SHORT $LL4@test2
clang只是在兩種情況下調用memset
。
故事的寓意:如果你關心性能,請查看生成的代碼。 向供應商報告問題。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.