如何拆分程序以充分利用多CPU，多核和超线程？

Question

I have a bunch of commands to execute for gene sequecing. 我有一堆命令要执行基因分离。 For example: 例如：

msclle_program -in 1.txt
msclle_program -in 2.txt
msclle_program -in 3.txt
      .........
msclle_program -in 10.txt

these commands are independent of each other. 这些命令彼此独立。

The Environment is Linux Desktop, Intel i7( 4 core/8 thread )× 7 , 12G memory 环境是Linux桌面，英特尔I7（4核/ 8线程 ）×7，12G存储器

I can split these commands into different n.sh programs and run them simultaneously. 我可以将这些命令拆分为不同的n.sh程序并同时运行它们。

My question is How can I fully utilize multi-CPU, multi-Core and hyper-Threading to make the program run faster? 我的问题是如何充分利用多CPU，多核和超线程来使程序运行得更快？

More specifically, how many program files should I split into? 更具体地说，我应该拆分多少个程序文件 ？

My own understanding is: 我个人的理解是：

split into 7 program files. 分成7个程序文件。 So each CPU will 100% run one program 因此每个CPU将100％运行一个程序
With one CPU, the CPU will utilize its multi-core and multi-thread by its own. 使用一个CPU，CPU将自己利用其多核和多线程。

Is it True? 这是真的吗？

many thanks for ur comments. 非常感谢你的评论。

Answer 1

The answers is not simple or straightforward and splitting the task into one programme per CPU is likely to be non-optimal and may indeed be poor or even extremely poor. 答案并不简单或直接，每个CPU将任务分成一个程序可能不是最优的，可能确实很差甚至非常差。

First, as I understand it, you have seven quad-core CPUs (presumably there are eight, but you're saving one for the OS?). 首先，据我所知，你有七个四核CPU（大概有八个，但你要为操作系统保存一个？）。 If you run a single threaded process on each CPU, you will be using a single thread on a single core. 如果在每个CPU上运行单线程进程，则将在单个核心上使用单个线程。 The other three cores and all of the hyperthreads will not be used. 其他三个核心和所有超线程将不会被使用。

The hardware and OS cannot split a single thread over multiple cores. 硬件和操作系统无法在多个核心上拆分单个线程。

You could however run four single-threaded processes per CPU (one per core), or even eight (one per hyperthread). 但是，您可以为每个CPU运行四个单线程进程（每个核心一个），甚至八个（每个超线程一个）。 Whether or not this is optimal depends on the work being done by the processes; 这是否最佳取决于过程所做的工作; in particular, their working set size and memory access patterns, and upon the hardware cache arrangements; 特别是它们的工作集大小和内存访问模式，以及硬件缓存安排; the number of levels of cache, their sizes and their sharing. 缓存级别的数量，大小和共享。 Also the NUMA arrangement of the cores needs to be considered. 还需要考虑核心的NUMA布置。

Basically speaking, an extra thread has to give you quite a bit of speed-up to outweigh what it can cost you in cache utilization, main memory accesses and the disruption of pre-fetching. 基本上，一个额外的线程必须提供相当多的加速，超过它在缓存利用率，主存储器访问和预取中断方面的成本。

Furthermore, because the effects of the working set exceeding certain caching limits is profound, what seems fine for say one or two cores may be appalling for four or eight, so you can't even experiment with one core and assume the results are useful over eight. 此外，因为工作集超过某些缓存限制的效果是深远的，所以看起来好的一个或两个核心可能令人震惊的四或八，所以你甚至不能尝试一个核心并假设结果是有用的八。

Having a quick look, I see i7 has a small L2 cache and a huge L3 cache. 快速浏览一下，我看到i7有一个小的L2缓存和一个巨大的L3缓存。 Given your data set, I wouldn't be surprised if there's a lot of data being processed. 鉴于您的数据集，如果处理大量数据，我不会感到惊讶。 The question is whether or not it is sequentially processed (eg if prefetching will be effective). 问题是它是否是按顺序处理的（例如，预取是否有效）。 If the data is not sequentially processed, you may do better by reducing the number of concurrent processes, so their working sets tend to fit inside the L3 cache. 如果数据没有按顺序处理，您可以通过减少并发进程数来做得更好，因此它们的工作集往往适合L3缓存。 I suspect if you run eight or sixteen processes, the L3 cache will be hammered - overflowed. 我怀疑如果你运行8或16个进程，L3缓存将被敲打 - 溢出。 OTOH, if your data access is non-sequential, the L3 cache prolly isn't going to save you anyway. OTOH，如果您的数据访问是非顺序的，那么L3缓存无论如何都不会为您节省费用。

Answer 2

You can spawn multiple processess and then assign each process to one cpu. 您可以生成多个进程，然后将每个进程分配给一个cpu。 You can use taskset -c to do this. 您可以使用taskset -c执行此操作。

Have a rolling number and increment to specify the processor number. 有滚动数字和增量以指定处理器编号。

Answer 3

split into 7 program files. 分成7个程序文件。 So each CPU will 100% run one program. 因此每个CPU将100％运行一个程序。

This is approximately correct: if you have 7 single-threaded programs and 7 processing units, then each of them has one thread to run. 这大致是正确的：如果你有7个单线程程序和7个处理单元，那么每个程序都有一个线程可以运行。 This is optimal: less programs, and some processing units would be idle; 这是最佳的：较少的程序，并且一些处理单元将是空闲的; more programs, and time would be wasted to alternating between them. 更多的节目，浪费时间在他们之间交替。 Although, if you have 7 quad-core processors, then the optimum number of threads (from "CPU bound perspective") would be 28. This is simplified, as in reality there will be other programs around to share the CPU. 虽然，如果你有7个四核处理器，那么最佳线程数（从“CPU绑定角度”）将是28.这是简化的，因为实际上将有其他程序来共享CPU。

With one CPU, the CPU will utilize its multi-core and multi-thread by its own. 使用一个CPU，CPU将自己利用其多核和多线程。

No. Whether or not all cores are in the single CPU or not makes little difference (it does make some difference in caching, though). 不是。所有内核是否都在单CPU中没有什么区别（但它在缓存方面确实有所不同）。 Anyway, the processor won't do any multithreading by its own. 无论如何，处理器不会自己进行任何多线程处理。 It's the programmer's job. 这是程序员的工作。 That's why making programs faster has become very challenging nowadays: until about 2005 or so it was free ride, as the clock frequencies were steadily rising, but now the limit has been reached, and speeding up programs requires splitting them into the growing number of processing units. 这就是为什么现在让程序变得更快变得非常具有挑战性：直到大约2005年左右，随着时钟频率稳步上升，这是免费乘车，但现在已达到极限，加速程序需要将它们分成越来越多的处理单位。 It's one of the major reasons for the renaissance of functional programming. 这是函数式编程复兴的主要原因之一。

Answer 4

Why run them as separate processes? 为何将它们作为单独的进程运行 Consider running multiple threads in one process instead which would make both the memory footprint much smaller and lower the amount of process scheduling required. 考虑在一个进程中运行多个线程，这样可以使内存占用空间更小，并降低所需的进程调度量。

You could look at it this way (a bit over-simplified but still): 你可以这样看待它（有点过于简化但仍然）：

Consider dividing up your work into processable units (PU). 考虑将您的工作分为可处理单元（PU）。 You then want two or more cores to each process one PU at a time such that they don't interfere with each other and the more cores the more PUs you can process. 然后，您需要两个或多个核心，每次处理一个PU，这样它们不会相互干扰，核心越多，您可以处理的PU就越多。

The work involved for processing one PU is input+processing+output (I+P+O). 处理一个PU所涉及的工作是输入+处理+输出（I + P + O）。 Since it is probably processing units from large memory structures containing perhaps millions or more the input and output have mostly to do with memory. 由于它可能是来自大型存储器结构的处理单元，其中包含数百万或更多的输入和输出主要与存储器有关。 With one core this is not a problem because no other core interferes with the memory accesses. 使用一个核心这不是问题，因为没有其他核心干扰内存访问。 With multiple cores the problem is moved basically to the nearest common resource, in this case the L3 cache giving cache input (CI) and cache output (CO). 对于多个核心，问题基本上移动到最近的公共资源，在这种情况下，L3缓存提供缓存输入（CI）和缓存输出（CO）。 With two cores you would want CI+CO to equal P/2 or less because then the two cores could take turns accessing the nearest common resource (the L3 cache) and not interfere with each other. 对于两个内核，您希望CI + CO等于或小于P / 2，因为这两个内核可以轮流访问最近的公共资源（L3缓存）而不会相互干扰。 With three cores CI+CO would need to be P/3 and four or eight cores you would need CI+CO to equal P/4 or P/8. 对于三个核心CI + CO需要P / 3和四个或八个核心，您需要CI + CO等于P / 4或P / 8。

So the trick is to make the processing required for a PU reside completely inside a core and its own caches (L1 and L2). 因此，诀窍是使PU所需的处理完全驻留在核心及其自身的缓存（L1和L2）中。 The more cores you have the larger the PUs should be (in relation to the I/O required) such that the PU stays isolated inside its core as long as possible and with all the data it needs available in its local caches. 您拥有的核心越多，PU应该越大（与所需的I / O相关），使得PU尽可能长时间地保持在其核心内部，并且需要在其本地缓存中提供所需的所有数据。

To sum it up you want the cores to do as much meaningful and efficient processing as possible while impacting the L3 cache as little as possible because the L3 cache is the bottleneck. 总结一下，您希望内核尽可能多地进行有意义且高效的处理，同时尽可能少地影响L3缓存，因为L3缓存是瓶颈。 It's a challenge to achieve such a balance but by no means impossible. 实现这种平衡是一项挑战，但绝不是不可能的。

As you understand, the cores executing "traditional" multi-threaded administrative or web applications (where no care whatsoever is taken to economize on L3 accesses) will constantly be colliding with each other for access to the L3 cache and resources further out. 如您所知，执行“传统”多线程管理或Web应用程序的核心（无需任何关心以节省L3访问）将不断相互冲突，以便进一步访问L3缓存和资源。 It is not uncommon for multi-threaded programs running on multiple cores to be slower than if they'd been running on single cores. 在多个内核上运行的多线程程序比在单核上运行的程序要慢，这种情况并不少见。

Also, don't forget that OS work impacts the cache (a lot) as well. 另外，不要忘记操作系统工作也会影响缓存（很多）。 If you divide the problem into separate processes (as I mentioned above) you'll be calling in the OS to referee much more often than is absolutely neccessary. 如果你将问题分成不同的过程（如上所述），你将在操作系统中调用比绝对必要的更多的裁判。

My experience is that the existence, dos and don'ts of the problem are mostly unknown or not understood. 我的经验是，问题的存在，应有和不应该是大多数未知或不理解。

如何拆分程序以充分利用多CPU，多核和超线程？

问题描述

4 个解决方案

解决方案1
6 已采纳

解决方案2
1 2011-01-20 04:05:30

解决方案3
1 2011-01-21 13:49:55

解决方案4
0 2011-04-16 14:44:43

如何拆分程序以充分利用多CPU，多核和超线程？

问题描述

4 个解决方案

解决方案1 6 已采纳

解决方案2 1 2011-01-20 04:05:30

解决方案3 1 2011-01-21 13:49:55

解决方案4 0 2011-04-16 14:44:43

解决方案1
6 已采纳

解决方案2
1 2011-01-20 04:05:30

解决方案3
1 2011-01-21 13:49:55

解决方案4
0 2011-04-16 14:44:43