如何测量for循环的速度差异？

Question

I am curious about the items below in for loop.我很好奇 for 循环中的以下项目。

for(auto) vs for(auto &) for(auto) 与 for(auto &)
Separating the for loop分离for循环
for(auto &) vs for(const auto &) for(auto &) 与 for(const auto &)
for(int: list) vs for(auto: list) [list is integer vector] for(int: list) vs for(auto: list) [列表是 integer 向量]

So, I wrote the below code for testing in the C++17 version.因此，我编写了以下代码用于在 C++17 版本中进行测试。

It looks like seems difference in CMake debug mode(without optimization) CMake 调试模式似乎有所不同（未优化）

// In debug mode
1. elapsed: 7639 (1663305922550 - 1663305914911)
2. elapsed: 3841 (1663305926391 - 1663305922550)
3. elapsed: 3810 (1663305930201 - 1663305926391)

But in release mode(with gcc -O3) there is no difference between 1 ~ 3但是在release模式下（用gcc -O3）1~3没有区别

// release mode
1. elapsed: 0 (1663305408984 - 1663305408984)
2. elapsed: 0 (1663305409984 - 1663305409984)
3. elapsed: 0 (1663305410984 - 1663305410984)

I don't know if my test method is wrong, Or is it correct that there is no difference depending on the optimization status?不知道是我的测试方法不对，还是根据优化状态没有区别是正确的？

Here is my testing source code.这是我的测试源代码。

// create test vector
const uint64_t max_ = 499999999;    // 499,999,999
std::vector<int>   v;
for (int i = 1; i < max_; i++)
    v.push_back(i);


// test 1.
auto start1 = getTick();
for (auto& e : v)
{
    auto t = e + 100;    t += 300;
}
for (auto& e : v)
{
    auto t = e + 200;    t += 300;
}
auto end1 = getTick();


// test 2.
// Omit tick function
for (auto& e : v)
{
    auto t1 = e + 100;    t1 += 300;
    auto t2 = e + 200;    t2 += 300;
}


// test 3.
for (auto e : v)
{
    auto t1 = e + 100;    t1 += 300;
    auto t2 = e + 200;    t2 += 300;
}

...

And then, getTick() was obtained through chrono milliseconds.然后，通过chrono毫秒获得getTick()。

uint64_t getTick()
{
    return (duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count());
}

Also, this testing progressed on Debian aarch64此外，此测试在 Debian aarch64 上进行

Jetson Xavier NX (jetpack 4.6, ubuntu 18.04LTS) Jetson Xavier NX（喷气背包 4.6，ubuntu 18.04LTS）
8Gb RAM 8Gb 内存
GCC 7.5.0 GCC 7.5.0

Please advise if there is anything wrong.如有不妥请指教。 Thank you!谢谢！

Answer 1

An empty loop can optimize away, so your compiler correctly does that.一个空循环可以优化掉，所以你的编译器可以正确地做到这一点。 But benchmarking with optimization disabled is not meaningful .但是禁用优化的基准测试没有意义。 C++ requires optimization to get the performance we expect for production use (especially with template library functions), and optimization or not isn't a constant factor speedup; C++ 需要优化以获得我们期望的生产使用性能（尤其是使用模板库函数），并且优化与否并不是一个常数因子加速； it makes different ways to express the same logic lead to different asm, when in a normally optimized build they'd compile to the same asm.它使表达相同逻辑的不同方式导致不同的asm，在通常优化的构建中，它们会编译为相同的asm。

You can't infer anything from a debug build about what's faster in a release build, not about small-scale micro-optimization things like this.您无法从调试版本中推断出发布版本中更快的内容，而不是像这样的小规模微优化。 See also Idiomatic way of performance evaluation?另请参阅绩效评估的惯用方式？

With optimization enabled, copying to a local object can remove most of the work if you only use one member of that copy.启用优化后，如果您只使用该副本的一个成员，则复制到本地 object 可以删除大部分工作。 Get used to thinking of what real work actually needs to happen for the code;习惯于思考代码实际需要进行的实际工作； often a compiler will figure out what that minimum is.编译器通常会找出最小值是多少。 For example, auto & isn't actually going to put a pointer in a register and dereference it beyond what it was doing to loop over the array in the first place, the reference variable doesn't actually exist anywhere in the asm as a separate value in a register or memory.例如， auto &实际上并不会将指针放入寄存器并取消引用它超出它首先循环数组所做的操作，引用变量实际上并不作为单独的存在于 asm 中的任何位置寄存器或 memory 中的值。

So this isn't something you can isolate in a benchmark without some real work in the loop, eg summing an array, or modifying every element.因此，如果没有循环中的一些实际工作（例如对数组求和或修改每个元素），这不是您可以在基准测试中隔离的东西。 You could try using something like Benchmark::DoNotOptimize or similar inline asm to make the compiler materialize a value in a register without doing anything else, but to be sure you're benchmarking exactly the right thing, you need to understand asm and check the compiler output.您可以尝试使用Benchmark::DoNotOptimize或类似的内联 asm 来使编译器在不执行任何其他操作的情况下实现寄存器中的值，但要确保您的基准测试完全正确，您需要了解 asm 并检查编译器 output。 (Microbenchmarking is hard.) In which case you probably can already answer the question just by looking at the asm and seeing that it's the same either way in normal cases. （微基准测试很难。）在这种情况下，您可能已经可以通过查看 asm 来回答这个问题，并且在正常情况下它是相同的。

It's probably easier to just check which things all compile to the same asm with optimization enabled, instead of trying to guess whether small differences in experimental timing are due to noise or might be a real difference.在启用优化的情况下检查哪些东西都编译到相同的 asm 可能更容易，而不是试图猜测实验时间的微小差异是由于噪声还是可能是真正的差异。 (And if there is a difference, whether it's just a coincidence that it was faster with this luck of the draw for code alignment and surrounding code, on this particular CPU.) （如果有区别，是否只是巧合，在这个特定的 CPU 上，代码 alignment 和周围代码的抽签速度更快。）

Answer 2

-O3 mode will do the compile optimization and remove the code. -O3模式将进行编译优化并删除代码。 You can try to declare variables in for-loop global to avoid compiler optimization.您可以尝试在 for-loop 全局中声明变量以避免编译器优化。

如何测量for循环的速度差异？

问题描述

1 个解决方案

解决方案1
0 2022-09-16 06:24:36

解决方案2
-1 2022-09-16 05:52:41

如何测量for循环的速度差异？

问题描述

1 个解决方案

解决方案1 0 2022-09-16 06:24:36

解决方案2 -1 2022-09-16 05:52:41

解决方案1
0 2022-09-16 06:24:36

解决方案2
-1 2022-09-16 05:52:41