简体   繁体   English

迭代时的性能(缓存未命中)

[英]The performance at iteration (cache miss)

I have found out that the iteration goes through a vector faster when instead of using a variable (i) to count up std::vector<T>::iterator is used. 我发现迭代通过一个向量更快,而不是使用变量(i)来计算std::vector<T>::iterator被使用。

Thanks to a few comments, here is some additional information: (1) I use the Visual Studio C++ Compiler; 感谢一些评论,这里有一些额外的信息:(1)我使用Visual Studio C ++编译器; (2) I compiled in release mode and with the optimization -O2 :) (2)我在发布模式下编译并使用优化-O2 :)

Image of the console 控制台的图像

If the variable i is incremented, the iteration takes 如果变量i递增,则迭代进行

5875ms: 5875ms:

std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

or 5723ms: 或5723ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec2[i]->x = 0;
    vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

If std::vector<Data>::Iterator is used to iterate, the iteration will take 如果使用std::vector<Data>::Iterator进行迭代,则迭代将采用

29ms: 29ms:

std::vector<Data> vec(MAX_DATA);

stopWatch.start();
for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

or 110ms: 或110ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (auto& it : vec2) {
    it->x = 0;
    it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

Why is the other iteration so much faster? 为什么另一次迭代要快得多?

I'm wondering that the iteration with the variable i at which the data is at different positions in the memory is as fast as the iteration with the variable i, where data is juxtaposed in the memory. 我想知道,数据位于存储器中不同位置的变量i的迭代与使用变量i的迭代一样快,其中数据并置在存储器中。 The fact that the data is next to each other in the memory should reduce cache misses and that works with the iteration with std::vector<Data>::Iterator , why not with the other one? 数据在内存中彼此相邻的事实应该减少缓存未命中,并且与使用std::vector<Data>::Iterator ,为什么不与另一个一起使用? Or do I dare and the distance of 29 to 110ms is not the cache misses in debt? 或者我是否敢于和29到110毫秒的距离不是缓存错失的债务?

The entire program looks like this: 整个程序看起来像这样:

#include <iostream>
#include <chrono>
#include <vector>
#include <string>

class StopWatch
{
public:
    void start() {
        this->t1 = std::chrono::high_resolution_clock::now();
    }

    void stop() {
        this->t2 = std::chrono::high_resolution_clock::now();
        this->diff = t2 - t1;
    }

    void printSpanAsMs(std::string startText = "time span") {
        long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
        (diff).count();
        std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
    }
private:
    std::chrono::high_resolution_clock::time_point t1, t2;
    std::chrono::high_resolution_clock::duration   diff;
} stopWatch;

struct Data {
    int x, y;
};

const unsigned long MAX_DATA = 20000000;

void test1()
{
    std::cout << "1. Test \n Use i to iterate through the vector" << 
    std::endl;

    std::vector<Data> vec(MAX_DATA);
    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec[i].x = 0;
        vec[i].y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec2[i]->x = 0;
        vec2[i]->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        delete vec2[i];
        vec2[i] = nullptr;
    }
}

void test2()
{
    std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through 
    the vector" << std::endl;

    std::vector<Data> vec(MAX_DATA);

    stopWatch.start();
    for (auto& it : vec) {
        it.x = 0;
        it.y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (auto& it : vec2) {
        it->x = 0;
        it->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (auto& it : vec2) {
        delete it;
        it = nullptr;
    }
}

int main()
{
    test1();
    test2();

    system("PAUSE");
    return 0;
}

Why is the other iteration so much faster? 为什么另一次迭代要快得多?

The reason is that MSVC 2017 cannot optimize it properly. 原因是MSVC 2017无法正确优化它。

In the first case it completely fails to optimize the loop: 在第一种情况下,它完全无法优化循环:

for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}

Generated code ( live demo ): 生成的代码( 现场演示 ):

        xor      r9d, r9d
        mov      eax, r9d
$LL4@test1:
        mov      rdx, QWORD PTR [rcx]
        lea      rax, QWORD PTR [rax+16]
        mov      DWORD PTR [rax+rdx-16], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-12], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-8], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-4], r9d
        sub      r8, 1
        jne      SHORT $LL4@test1

Replacing unsigned i with size_t i or hoisting indexed access into a reference doesn't help ( demo ). size_t i替换unsigned i或将索引访问提升为引用并没有帮助( 演示 )。

The only thing that helps is using an iterator like you have already found out: 唯一有用的是使用像你已经发现的迭代器:

for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}

Generated code ( live demo ): 生成的代码( 现场演示 ):

        xor      ecx, ecx
        npad     2
$LL4@test2:
        mov      QWORD PTR [rax], rcx
        add      rax, 8
        cmp      rax, rdx
        jne      SHORT $LL4@test2

clang just calls memset in both cases. clang只是在两种情况下调用memset

The moral of the story: look at the generated code if you care about performance. 故事的寓意:如果你关心性能,请查看生成的代码。 Report issues to the vendor. 向供应商报告问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM