埃拉托色尼 C++ 代码的筛选在连续运行中加速 - 为什么？

Question

I had a quick check around for this question, but couldn't find an answer - although I guess it might have been brought up here before.我对这个问题进行了快速检查，但找不到答案 - 尽管我猜它之前可能已经在这里提出过。

I was messing around writing a simple implementation of sieve of eratosthenes in c++ and timing the outcome:我正在忙着用 C++ 编写一个简单的埃拉托色尼筛实现，并为结果计时：

#include <iostream> 
#include <math.h>


int main() {

  int n = 100000;
  int seive [n];

  for (int i=0; i<n; i++) {
    seive[i] = i;
  }

  for (int i=2; i < ceil(sqrt(n)); i++) {
    for (int j=i*2; j<=n; j+=i) {
      seive[j-1] = -9;
    }
  }

  for (int i=0; i<n; i++) {
    if (seive[i] != -9) {
      std::cout << i+1 << "\n";
    }
  }

  return 0;
}

I compile it using:我使用以下方法编译它：

g++ seive.cpp -o seiveCpp

And then time it using:然后使用它计时：

time ./seiveCpp

First time:第一次：

./seiveCpp  0.01s user 0.01s system 10% cpu 0.184 total

Second time:第二次：

./seiveCpp  0.01s user 0.01s system 58% cpu 0.034 total

Third time:第三次：

./seiveCpp  0.01s user 0.01s system 59% cpu 0.037 total

etc.等等。

If I repeat this multiple times, it seems like running the code is always around 5x slower the first time than all the successive times.如果我多次重复此操作，似乎第一次运行代码总是比所有连续时间慢 5 倍左右。

What is the reason behind this happening?这背后的原因是什么？

I am running this on a 2017 MacBook Pro, 2.3 GHz Dual-Core Intel Core i5, and compiling with Apple clang version 11.0.0 (clang-1100.0.33.12我在 2017 款 MacBook Pro、2.3 GHz 双核 Intel Core i5 上运行它，并使用 Apple clang 版本 11.0.0 (clang-1100.0.33.12

Answer 1

The reason is because of the branch predictor.原因是因为分支预测器。 While running the first time computer doesn't know anything about the program, but while executing it is finding the logic in the jumps in your code (for and if) and then can better predict which branch it should take.第一次运行时，计算机对程序一无所知，但在执行它时，会在代码（for 和 if）的跳转中找到逻辑，然后可以更好地预测它应该采用哪个分支。 In modern processors, there are long pipelines of commands, so correct predicting of the jump can significantly decrease the time of work.在现代处理器中，有很长的命令管道，因此正确预测跳转可以显着减少工作时间。

So to compare a few algorithms by the execution time, it is good practice to run a hundred of times and take the smallest time.因此，要按执行时间比较几个算法，最好运行一百次并花费最少的时间。

Answer 2

Given the very large difference, I would guess that the CPU is in a lower performance mode when you start the first run, but then under load from the first run the OS switches it into a higher performance mode, which you observe as lowered execution time.鉴于非常大的差异，我猜想 CPU 在您开始第一次运行时处于较低的性能模式，但是在第一次运行的负载下，操作系统将其切换到较高的性能模式，您观察到执行时间降低.

Make sure your notebook is connected to AC power and that all power-saving options are disabled if you want to avoid the effect.如果您想避免这种影响，请确保您的笔记本电脑连接到交流电源并且禁用所有节能选项。

In any case there will still be caching effects left (eg the contents in the executable might be cached in memory).在任何情况下，仍然会存在缓存效果（例如，可执行文件中的内容可能会缓存在内存中）。 But these shouldn't be on the order of 100ms, I think.但我认为这些不应该是 100 毫秒的数量级。

In general when you benchmark code, you should always do a warmup runs, because there will always be such effects to some degree for one reason or another.一般来说，当您对代码进行基准测试时，您应该始终进行热身运行，因为出于某种原因，总会在某种程度上产生这种影响。 You generally want to perform the actual test runs when the environment has reached an equilibrium state, so to speak.可以这么说，您通常希望在环境达到平衡状态时执行实际测试运行。

Answer 3

When running a program multiple times the first time the OS has to load the file into memory, the next time it is likely to already be present (although relocations may still be necessary depending on compiler/linker settings, namely whether position-independant-code is generated).当操作系统第一次将文件加载到内存中多次运行程序时，下一次它很可能已经存在（尽管根据编译器/链接器的设置，重定位可能仍然是必要的，即位置无关代码生成）。 The branch location answer would be much more likely to apply if you were running the same code many times within a single process (which is a good idea when gathering performance data - put the timing code in your program and run the code of interest multiple times, timing each loop rather than running your entire program multiple times and using an external time program).如果您在单个进程中多次运行相同的代码（这是收集性能数据时的一个好主意 - 将计时代码放入程序中并多次运行感兴趣的代码），分支位置答案将更有可能适用，为每个循环计时，而不是多次运行整个程序并使用外部时间程序）。

埃拉托色尼 C++ 代码的筛选在连续运行中加速 - 为什么？

问题描述

3 个解决方案

解决方案1
3 已采纳 2020-03-28 21:02:44

解决方案2
2 2020-03-28 21:05:57

解决方案3
1 2020-03-28 21:20:09

埃拉托色尼 C++ 代码的筛选在连续运行中加速 - 为什么？

问题描述

3 个解决方案

解决方案1 3 已采纳 2020-03-28 21:02:44

解决方案2 2 2020-03-28 21:05:57

解决方案3 1 2020-03-28 21:20:09

解决方案1
3 已采纳 2020-03-28 21:02:44

解决方案2
2 2020-03-28 21:05:57

解决方案3
1 2020-03-28 21:20:09