c ++线程开销

Question

I'm playing around with threads in C++, in particular using them to parallelize a map operation. 我正在使用C ++中的线程，特别是使用它们来并行化地图操作。

Here's the code: 这是代码：

#include <thread>
#include <iostream>
#include <cstdlib>
#include <vector>
#include <math.h>
#include <stdio.h>

double multByTwo(double x){
  return x*2;
}

double doJunk(double x){
  return cos(pow(sin(x*2),3));
}

template <typename T>
void map(T* data, int n, T (*ptr)(T)){
  for (int i=0; i<n; i++)
    data[i] = (*ptr)(data[i]);
}

template <typename T>
void parallelMap(T* data, int n, T (*ptr)(T)){
  int NUMCORES = 3;
  std::vector<std::thread> threads;
  for (int i=0; i<NUMCORES; i++)
    threads.push_back(std::thread(&map<T>, data + i*n/NUMCORES, n/NUMCORES, ptr));
  for (std::thread& t : threads)
    t.join();
}

int main()
{
  int n = 1000000000;
  double* nums = new double[n];
  for (int i=0; i<n; i++)
    nums[i] = i;

  std::cout<<"go"<<std::endl;

  clock_t c1 = clock();

  struct timespec start, finish;
  double elapsed;

  clock_gettime(CLOCK_MONOTONIC, &start);

  // also try with &doJunk
  //parallelMap(nums, n, &multByTwo);
  map(nums, n, &doJunk);

  std::cout << nums[342] << std::endl;

  clock_gettime(CLOCK_MONOTONIC, &finish);

  printf("CPU elapsed time is %f seconds\n", double(clock()-c1)/CLOCKS_PER_SEC);

  elapsed = (finish.tv_sec - start.tv_sec);
  elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;

  printf("Actual elapsed time is %f seconds\n", elapsed);
}

With multByTwo the parallel version is actually slightly slower (1.01 seconds versus .95 real time), and with doJunk its faster (51 versus 136 real time). 使用multByTwo ，并行版本实际上稍慢（1.01秒而不是.95实时），而使用doJunk则更快（51与136实时）。 This implies to me that 这对我意味着

the parallelization is working, and 并行化正在发挥作用
there is a REALLY large overhead with declaring new threads. 声明新线程的开销非常大。 Any thoughts as to why the overhead is so large, and how I can avoid it? 有关为什么开销如此之大，以及如何避免它的任何想法？

Answer 1

Just a guess: what you're likely seeing is that the multByTwo code is so fast that you're achieving memory saturation. 只是一个猜测：您可能会看到的是multByTwo代码如此之快以至于您实现了内存饱和。 The code will never run any faster no matter how much processor power you throw at it, because it's already going as fast as it can get the bits to and from RAM. 无论你向它投入多少处理器能力，代码都不会运行得更快，因为它的速度已经达到了可以从RAM获取的速度。

Answer 2

You did not specify the hardware that you test your program nor the compiler version and the operating system. 您没有指定测试程序的硬件，也没有指定编译器版本和操作系统。 I did try your code on our four-socket Intel Xeon systems under 64-bit Scientific Linux with g++ 4.7 compiled from source. 我确实在64位Scientific Linux下的四插槽Intel Xeon系统上尝试了您的代码，并从源代码编译了g++ 4.7。

First on an older Xeon X7350 system I got the following timings: 首先在较旧的Xeon X7350系统上，我得到以下时间：

multByTwo with map multByTwo与map

CPU elapsed time is 6.690000 seconds
Actual elapsed time is 6.691940 seconds

multByTwo with parallelMap on 3 cores multByTwo在3核上有parallelMap

CPU elapsed time is 7.330000 seconds
Actual elapsed time is 2.480294 seconds

The parallel speedup is 2.7x. 并行加速是2.7倍。

doJunk with map doJunk与map

CPU elapsed time is 209.250000 seconds
Actual elapsed time is 209.289025 seconds

doJunk with parallelMap on 3 cores doJunk with parallelMap在3个核心上

CPU elapsed time is 220.770000 seconds
Actual elapsed time is 73.900960 seconds

The parallel speedup is 2.83x. 并行加速是2.83x。

Note that X7350 is from the quite old pre-Nehalem "Tigerton" family with FSB bus and a shared memory controller located in the north bridge. 请注意，X7350来自相当古老的前Nehalem“Tigerton”系列，带有FSB总线和位于北桥的共享内存控制器。 This is a pure SMP system with no NUMA effects. 这是一个没有NUMA效果的纯SMP系统。

Then I run your code on a four-socket Intel X7550. 然后我在四插槽Intel X7550上运行你的代码。 These are Nehalem ("Beckton") Xeons with memory controller integrated into the CPU and hence a 4-node NUMA system. 这些是Nehalem（“Beckton”）Xeons，内存控制器集成在CPU中，因此是一个4节点NUMA系统。 Threads running on one socket and accessing memory located on another socket will run somewhat slower. 在一个套接字上运行并访问位于另一个套接字上的内存的线程运行速度稍慢。 The same is also true for a serial process that might get migrated to another socket by some stupid scheduler decision. 对于可能通过某些愚蠢的调度程序决策迁移到另一个套接字的串行进程也是如此。 Binding in such a system is very important as you may see from the timings: 从时间上看，绑定在这样的系统中是非常重要的：

multByTwo with map multByTwo与map

CPU elapsed time is 4.270000 seconds
Actual elapsed time is 4.264875 seconds

multByTwo with map bound to NUMA node 0 multByTwo ， map绑定到NUMA节点0

CPU elapsed time is 4.160000 seconds
Actual elapsed time is 4.160180 seconds

multByTwo with map bound to NUMA node 0 and CPU socket 1 multByTwo ， map绑定到NUMA节点0和CPU插槽1

CPU elapsed time is 5.910000 seconds
Actual elapsed time is 5.912319 seconds

mutlByTwo with parallelMap on 3 cores mutlByTwo在3个核心上使用parallelMap

CPU elapsed time is 7.530000 seconds
Actual elapsed time is 3.696616 seconds

Parallel speedup is only 1.13x (relative to the fastest node-bound serial execution). 并行加速仅为1.13x（相对于最快的节点绑定串行执行）。 Now with binding: 现在绑定：

multByTwo with parallelMap on 3 cores bound to NUMA node 0 multByTwo与3个核心上的parallelMap绑定到NUMA节点0

CPU elapsed time is 4.630000 seconds
Actual elapsed time is 1.548102 seconds

Parallel speedup is 2.69x - as much as for the Tigerton CPUs. 并行加速是2.69倍 - 与Tigerton CPU一样多。

multByTwo with parallelMap on 3 cores bound to NUMA node 0 and CPU socket 1 multByTwo与3个核心上的parallelMap绑定到NUMA节点0和CPU插槽1

CPU elapsed time is 5.190000 seconds
Actual elapsed time is 1.760623 seconds

Parallel speedup is 2.36x - 88% of the previous case. 并行加速比前一种情况的2.36倍 - 88％。

(I was too impatient to wait for the doJunk code to finish on the relatively slower Nehalems but I would expect somewhat better performance as was in Tigerton case) （我太doJunk等待doJunk代码在相对较慢的Nehalems上完成，但我希望在Tigerton案例中有更好的表现）

There is one caveat with NUMA binding though. 但是有一个关于NUMA绑定的警告。 If you force eg binding to NUMA node 0 with numactl --cpubind=0 --membind=0 ./program this will limit memory allocation to this node only and on your particular system the memory attached to CPU 0 might not be enough and a run-time failure will most likely occur. 如果强制例如使用numactl --cpubind=0 --membind=0 ./program绑定到NUMA节点0，这将限制仅限于此节点的内存分配，并且在您的特定系统上，连接到CPU 0的内存可能不够，并且很可能会发生运行时故障。

As you can see there are factors, other than the overhead from creating threads, that can significantly influence your code execution time. 正如您所看到的，除了创建线程的开销之外，还有一些因素可能会显着影响您的代码执行时间。 Also on very fast systems the overhead can be too high compared to the computational work done by each thread. 同样在非常快的系统上，与每个线程完成的计算工作相比，开销可能太高。 That's why when asking questions concerning parallel performance, one should always include as much details as possible about the hardware and the environment used to measure the performance. 这就是为什么在提出有关并行性能的问题时，应该始终包含尽可能详细的硬件和用于衡量性能的环境的细节。

Answer 3

Multiple threads can only do more work in less time on a multi-core machine. 多线程只能在更少的时间内在多核机器上完成更多的工作。

Other wise they are just taking turns in a Round-Robin fashion. 其他明智的他们只是轮流轮流时尚。

Answer 4

Spawning new threads can be an expensive operation depending on the platform. 根据平台的不同，产生新线程可能是一项昂贵的操作。 The easiest way to avoid this overhead is to spawn a few threads at the launch of the program and have some sort of job queue. 避免这种开销的最简单方法是在程序启动时生成一些线程并具有某种作业队列。 I believe std::async will do this for you. 我相信std :: async会为你做这件事。

c ++线程开销

问题描述

4 个解决方案

解决方案1
7 已采纳 2012-06-22 15:50:31

解决方案2
3 2012-06-22 17:31:26

解决方案3
2 2012-06-22 15:49:40

解决方案4
0 2012-06-22 15:48:07

c ++线程开销

问题描述

4 个解决方案

解决方案1 7 已采纳 2012-06-22 15:50:31

解决方案2 3 2012-06-22 17:31:26

解决方案3 2 2012-06-22 15:49:40

解决方案4 0 2012-06-22 15:48:07

解决方案1
7 已采纳 2012-06-22 15:50:31

解决方案2
3 2012-06-22 17:31:26

解决方案3
2 2012-06-22 15:49:40

解决方案4
0 2012-06-22 15:48:07