为什么我不能通过在C ++ 11中运行多个线程来获得任何性能改进？

Question

I have the following test program with a simple function that finds primes which I am trying to run in multiple threads (just as an example). 我有以下测试程序，其中包含一个简单的函数，可以找到我尝试在多个线程中运行的素数（仅作为示例）。

#include <cstdio>
#include <iostream>
#include <ctime>
#include <thread>

void primefinder(void)
{
   int n = 300000;

   int i, j;
   int lastprime = 0;
   for(i = 2; i <= n; i++) {
      for(j = 2; j <= i; j++) {
           if((i % j) == 0) {
               if(i == j)
                   lastprime = i;
               else {
                   break;
               }
           }
      }
   }

   std::cout << "Prime: " << lastprime << std::endl;
}

int main(void)
{
   std::clock_t start;
   start = std::clock();

   std::thread t1(primefinder);
   t1.join();

   std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

   start = std::clock();

   std::thread t2(primefinder);
   std::thread t3(primefinder);
   t2.join();
   t3.join();

   std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
   return 0;
}

As shown, I run the function once in 1 thread and then once in 2 different threads. 如图所示，我在1个线程中运行一次该函数，然后在2个不同的线程中运行一次。 I compile it with g++ using -O3 and -pthread. 我用g ++使用-O3和-pthread编译它。 I am running it on Linux Mint 18. I have a Core i5-4670. 我在Linux Mint 18上运行它。我有一个Core i5-4670。 I know it comes down to the OS but I would very much expect these threads to run in somewhat parallel. 我知道它归结为操作系统，但我非常希望这些线程在某种程度上并行运行。 When I run the program, top shows 100% CPU when using 1 thread and 200% CPU when using 2 threads. 当我运行程序时，top使用1个线程时显示100％CPU，使用2个线程时显示200％CPU。 Despite this the second run takes almost exactly twice as long. 尽管如此，第二次运行几乎只需要两倍的时间。

The CPU is doing nothing else while running the program. 运行程序时，CPU不执行任何其他操作。 Why doesn't this get executed in parallel ? 为什么不并行执行？

Edit: I know both threads are doing the exact same thing. 编辑：我知道两个线程都在做同样的事情。 I chose the primerfinder function simply as an example of something embarrassingly parallel so when I run it in multiple threads it should take just as long in real time. 我选择了primerfinder函数只是作为一个令人尴尬的并行的例子，因此当我在多个线程中运行它时，它应该花费实时的时间。

Answer 1

Using std::clock to time a parallel program in c++ is very deceptive. 使用std :: clock来计算c ++中的并行程序非常具有欺骗性。 There are two types of time that we care about when timing a program: wall time and cpu time. 在计划计划时我们关心的时间有两种：挂起时间和CPU时间。 Wall time is what we are all used to (think clock on a wall). 壁挂时间是我们都习惯的（想想墙上的时钟）。 Cpu time is essentially how many cpu cycles your program used. CPU时间本质上是程序使用的cpu周期数。 std::clock returns cpu time (this is why you are dividing by CLOCKS_PER_SEC) and cpu time is only equal to wall time when there is one thread of execution. std :: clock返回cpu时间（这就是你除以CLOCKS_PER_SEC的原因）并且当有一个执行线程时，cpu time只等于wall time。 If a program can be run 100% in parallel (like your's), then cpu time = (number of threads)*(wall time). 如果一个程序可以100％并行运行（比如你的），那么cpu time =（线程数）*（wall time）。 So seeing almost exactly twice as long is exactly what you would expect. 因此，看到几乎两倍的长度正是您所期望的。

For measuring wall time (which is what you want to do), I don't know of a way to do that using the STL. 为了测量墙壁时间（这是你想要做的），我不知道如何使用STL来做到这一点。 You can measure it using OpenMP or Boost. 您可以使用OpenMP或Boost进行测量。

omp_get_wtime() omp_get_wtime（）

Boost Timer 升压定时器

Since you are on linux, the version of g++ that you are using more than likely has openmp support built in. 由于您使用的是Linux，因此您使用的g ++版本很可能内置了openmp支持。

#include <omp.h>

const double startTime = omp_get_wtime();
..... //Work goes here

const double time = omp_get_wtime() - startTime;

You will have to compile with -fopenmp 您必须使用-fopenmp进行编译

EDIT: 编辑：

As johnbakers pointed out, the chrono library does have a wall clock 正如johnbakers指出的那样，计时库确实有一个挂钟

#include <chrono>

auto start = std::chrono::system_clock::now();
.... //Do some work

auto end = std::chrono::system_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << "Time: " << diff.count() << "(s)" << std::end;

Output of that vs. boost timer: 输出与升压计时器：

Boost: 121.685972s wall, 724.940000s user + 67.660000s system = 792.600000s CPU  (651.3%)
Chrono: 121.683(s)

Answer 2

There's a pretty basic problem in your design, which explains why you don't see any benefits from threads. 您的设计中存在一个非常基本的问题，这就解释了为什么您没有看到线程带来的任何好处。

When you have a search problem like this and you want to speed it up with parallelization, the idea is usually to use divide and conquer . 当你遇到像这样的搜索问题并想要通过并行化加速时，通常会使用分而治之 。 You need to somehow divy up the work so that eg the first thread will do the first half of the work, and the second thread will do the second half of the work. 你需要以某种方式分配工作，以便例如第一个线程将执行工作的前半部分，第二个线程将执行工作的后半部分。

In your code, both threads call exactly the same function and they don't communicate -- they each duplicate the other guy's work! 在你的代码中，两个线程都调用完全相同的函数并且它们不进行通信 - 它们各自复制了另一个人的工作！

In some problems it's really easy to divy up the work. 在某些问题上，很容易将工作分开。 For instance, if you were working on a SAT solver , one way to divide the work between two threads would be to choose a variable x_1 , and say, the first thread will assume x_1 = true and check if the formula is satisfiable, and the second thread will assume x_1 = false and check if the formula is satisfiable. 例如，如果您正在使用SAT求解器，则在两个线程之间划分工作的一种方法是选择变量x_1 ，并且假设第一个线程将假设x_1 = true并检查公式是否可满足，并且第二个线程将假设x_1 = false并检查公式是否可满足。 Then when they are both joined you will know if overall the formula is satisfiable, and there are no mutexes or interthread communication needed. 然后，当它们都加入时，您将知道整体公式是否可满足，并且不需要互斥或互连通信。

In your primes problem it's a bit more complicated. 在你的素数问题中，它有点复杂。 You could try to do something like, the first thread only considers candidates ending in 1, 3, 5 and the second thread considers candidates ending in 7, 9 . 你可以尝试做类似的事情，第一个线程只考虑以1, 3, 5结尾的候选者，第二个线程认为候选者以7, 9结尾。 For best performance though, you probably want to use something like " Eratosthenes' Sieve " and I think that would be a little harder to parallelize. 为了获得最佳性能，您可能希望使用类似“ Eratosthenes'Sieve”的东西，我认为并行化会更难。 (You could maybe use an array of atomics?) （你可以使用原子数组吗？）

Answer 3

There are many possible reasons why the code is not in parallel. 有许多可能的原因代码不平行。

In days of old, operating systems would run pieces of executables in a round-robin or priority method. 在旧的日子里，操作系统将以循环法或优先级方法运行多个可执行文件。 So one thread of execution could run for a couple of milliseconds, then swapped out with another thread of execution. 因此，一个执行线程可以运行几毫秒，然后与另一个执行线程交换。 This would give the appearance that threads of execution are run in parallel. 这将给出执行线程并行运行的外观。

The swapping out of threads of execution would also occur when one thread waits on a resource and the resource is not available. 当一个线程等待资源并且资源不可用时，也会发生交换执行线程。

In modern computers, with multiple processors or cores, the technique is still the same. 在具有多个处理器或核心的现代计算机中，该技术仍然是相同的。 The OS has another processor it can delegate tasks to. 操作系统有另一个可以将任务委派给的处理器。 Core time is precious. 核心时间很宝贵。 The OS is unlikely to stop all the running tasks on multiple cores so your threads can execute in parallel. 操作系统不太可能停止多核上的所有正在运行的任务，因此您的线程可以并行执行。 This probably means that the OS would have to wait for one processor to finish so your threads can be executed at the same time. 这可能意味着操作系统必须等待一个处理器完成，以便您的线程可以同时执行。 Most likely ain't going to happen. 最有可能不会发生。

However, many OS have attributes that you can set up to tell them to give your threads exclusive access to one or more cores. 但是，许多操作系统都具有可以设置的属性，以告诉它们为线程提供对一个或多个内核的独占访问权限。 Since this is not a standard C++ functionality and OSes are not all the same, you'll have to look up the API or any compiler specific support. 由于这不是标准的C ++功能，并且操作系统不完全相同，因此您必须查找API或任何编译器特定的支持。

Edit 1: Interrupts and other tasks 编辑1：中断和其他任务
Keep in mind that your platform is not running your programs exclusively. 请记住，您的平台并非专门运行您的程序。 Other tasks are lurking and may be executing while your program is running. 其他任务潜伏着，并且可能在程序运行时执行。 Some examples include: virus checkers, things pinging the internet, and music players (at least on my machine). 一些例子包括：病毒检查程序，ping互联网的东西和音乐播放器（至少在我的机器上）。

These applications and interrupts may cause the OS to play round-robin with your threads on the same processor and not in parallel. 这些应用程序和中断可能会导致操作系统在同一处理器上与您的线程进行循环播放，而不是并行播放。 Once scenario would be to have the music player on one processor why your program is running on another. 一旦情况是将音乐播放器放在一个处理器上，为什么你的程序在另一个处理器上运行。

Answer 4

Take a look at the std::chrono namespace . 看一下std::chrono命名空间。 It offers many C++11 utilities for measuring time that are superior to every suggestion here thus far. 它提供了许多用于测量时间的C ++ 11实用程序，它们优于目前为止的每个建议。

For example, you can cast a clock reading to wall time. 例如，您可以将时钟读数转换为挂起时间。

Timing programming code is not trivial, but these tools do make it easier to explore along these lines. 时序编程代码并非易事，但这些工具确实使这些工具更容易探索。

为什么我不能通过在C ++ 11中运行多个线程来获得任何性能改进？

问题描述

4 个解决方案

解决方案1
8 已采纳 2016-07-02 17:25:41

解决方案2
2 2016-07-02 20:05:50

解决方案3
1 2016-07-02 17:30:38

解决方案4
1 2016-07-02 18:43:01

为什么我不能通过在C ++ 11中运行多个线程来获得任何性能改进？

问题描述

4 个解决方案

解决方案1 8 已采纳 2016-07-02 17:25:41

解决方案2 2 2016-07-02 20:05:50

解决方案3 1 2016-07-02 17:30:38

解决方案4 1 2016-07-02 18:43:01

解决方案1
8 已采纳 2016-07-02 17:25:41

解决方案2
2 2016-07-02 20:05:50

解决方案3
1 2016-07-02 17:30:38

解决方案4
1 2016-07-02 18:43:01