簡體   English   中英

迭代時的性能(緩存未命中)

[英]The performance at iteration (cache miss)

我發現迭代通過一個向量更快,而不是使用變量(i)來計算std::vector<T>::iterator被使用。

感謝一些評論,這里有一些額外的信息:(1)我使用Visual Studio C ++編譯器; (2)我在發布模式下編譯並使用優化-O2 :)

控制台的圖像

如果變量i遞增,則迭代進行

5875ms:

std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

或5723ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec2[i]->x = 0;
    vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

如果使用std::vector<Data>::Iterator進行迭代,則迭代將采用

29ms:

std::vector<Data> vec(MAX_DATA);

stopWatch.start();
for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

或110ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (auto& it : vec2) {
    it->x = 0;
    it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

為什么另一次迭代要快得多?

我想知道,數據位於存儲器中不同位置的變量i的迭代與使用變量i的迭代一樣快,其中數據並置在存儲器中。 數據在內存中彼此相鄰的事實應該減少緩存未命中,並且與使用std::vector<Data>::Iterator ,為什么不與另一個一起使用? 或者我是否敢於和29到110毫秒的距離不是緩存錯失的債務?

整個程序看起來像這樣:

#include <iostream>
#include <chrono>
#include <vector>
#include <string>

class StopWatch
{
public:
    void start() {
        this->t1 = std::chrono::high_resolution_clock::now();
    }

    void stop() {
        this->t2 = std::chrono::high_resolution_clock::now();
        this->diff = t2 - t1;
    }

    void printSpanAsMs(std::string startText = "time span") {
        long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
        (diff).count();
        std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
    }
private:
    std::chrono::high_resolution_clock::time_point t1, t2;
    std::chrono::high_resolution_clock::duration   diff;
} stopWatch;

struct Data {
    int x, y;
};

const unsigned long MAX_DATA = 20000000;

void test1()
{
    std::cout << "1. Test \n Use i to iterate through the vector" << 
    std::endl;

    std::vector<Data> vec(MAX_DATA);
    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec[i].x = 0;
        vec[i].y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec2[i]->x = 0;
        vec2[i]->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        delete vec2[i];
        vec2[i] = nullptr;
    }
}

void test2()
{
    std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through 
    the vector" << std::endl;

    std::vector<Data> vec(MAX_DATA);

    stopWatch.start();
    for (auto& it : vec) {
        it.x = 0;
        it.y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (auto& it : vec2) {
        it->x = 0;
        it->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (auto& it : vec2) {
        delete it;
        it = nullptr;
    }
}

int main()
{
    test1();
    test2();

    system("PAUSE");
    return 0;
}

為什么另一次迭代要快得多?

原因是MSVC 2017無法正確優化它。

在第一種情況下,它完全無法優化循環:

for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}

生成的代碼( 現場演示 ):

        xor      r9d, r9d
        mov      eax, r9d
$LL4@test1:
        mov      rdx, QWORD PTR [rcx]
        lea      rax, QWORD PTR [rax+16]
        mov      DWORD PTR [rax+rdx-16], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-12], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-8], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-4], r9d
        sub      r8, 1
        jne      SHORT $LL4@test1

size_t i替換unsigned i或將索引訪問提升為引用並沒有幫助( 演示 )。

唯一有用的是使用像你已經發現的迭代器:

for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}

生成的代碼( 現場演示 ):

        xor      ecx, ecx
        npad     2
$LL4@test2:
        mov      QWORD PTR [rax], rcx
        add      rax, 8
        cmp      rax, rdx
        jne      SHORT $LL4@test2

clang只是在兩種情況下調用memset

故事的寓意:如果你關心性能,請查看生成的代碼。 向供應商報告問題。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM