迭代時的性能（緩存未命中）

Question

我發現迭代通過一個向量更快，而不是使用變量（i）來計算std::vector<T>::iterator被使用。

感謝一些評論，這里有一些額外的信息：（1）我使用Visual Studio C ++編譯器; （2）我在發布模式下編譯並使用優化-O2 :)

如果變量i遞增，則迭代進行

5875ms：

std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

或5723ms：

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec2[i]->x = 0;
    vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

如果使用std::vector<Data>::Iterator進行迭代，則迭代將采用

29ms：

std::vector<Data> vec(MAX_DATA);

stopWatch.start();
for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

或110ms：

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (auto& it : vec2) {
    it->x = 0;
    it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

為什么另一次迭代要快得多？

我想知道，數據位於存儲器中不同位置的變量i的迭代與使用變量i的迭代一樣快，其中數據並置在存儲器中。 數據在內存中彼此相鄰的事實應該減少緩存未命中，並且與使用std::vector<Data>::Iterator ，為什么不與另一個一起使用？ 或者我是否敢於和29到110毫秒的距離不是緩存錯失的債務？

整個程序看起來像這樣：

#include <iostream>
#include <chrono>
#include <vector>
#include <string>

class StopWatch
{
public:
    void start() {
        this->t1 = std::chrono::high_resolution_clock::now();
    }

    void stop() {
        this->t2 = std::chrono::high_resolution_clock::now();
        this->diff = t2 - t1;
    }

    void printSpanAsMs(std::string startText = "time span") {
        long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
        (diff).count();
        std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
    }
private:
    std::chrono::high_resolution_clock::time_point t1, t2;
    std::chrono::high_resolution_clock::duration   diff;
} stopWatch;

struct Data {
    int x, y;
};

const unsigned long MAX_DATA = 20000000;

void test1()
{
    std::cout << "1. Test \n Use i to iterate through the vector" << 
    std::endl;

    std::vector<Data> vec(MAX_DATA);
    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec[i].x = 0;
        vec[i].y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec2[i]->x = 0;
        vec2[i]->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        delete vec2[i];
        vec2[i] = nullptr;
    }
}

void test2()
{
    std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through 
    the vector" << std::endl;

    std::vector<Data> vec(MAX_DATA);

    stopWatch.start();
    for (auto& it : vec) {
        it.x = 0;
        it.y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (auto& it : vec2) {
        it->x = 0;
        it->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (auto& it : vec2) {
        delete it;
        it = nullptr;
    }
}

int main()
{
    test1();
    test2();

    system("PAUSE");
    return 0;
}

Answer 1

為什么另一次迭代要快得多？

原因是MSVC 2017無法正確優化它。

在第一種情況下，它完全無法優化循環：

for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}

生成的代碼（現場演示）：

        xor      r9d, r9d
        mov      eax, r9d
$LL4@test1:
        mov      rdx, QWORD PTR [rcx]
        lea      rax, QWORD PTR [rax+16]
        mov      DWORD PTR [rax+rdx-16], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-12], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-8], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-4], r9d
        sub      r8, 1
        jne      SHORT $LL4@test1

用size_t i替換unsigned i或將索引訪問提升為引用並沒有幫助（演示）。

唯一有用的是使用像你已經發現的迭代器：

for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}

生成的代碼（現場演示）：

        xor      ecx, ecx
        npad     2
$LL4@test2:
        mov      QWORD PTR [rax], rcx
        add      rax, 8
        cmp      rax, rdx
        jne      SHORT $LL4@test2

clang只是在兩種情況下調用memset 。

故事的寓意：如果你關心性能，請查看生成的代碼。 向供應商報告問題。

迭代時的性能（緩存未命中）

問題描述

1 個解決方案

解決方案1
1 已采納 2017-10-30 15:50:05

迭代時的性能（緩存未命中）

問題描述

1 個解決方案

解決方案1 1 已采納 2017-10-30 15:50:05

解決方案1
1 已采納 2017-10-30 15:50:05