如何找到理想数量的并行进程以使用 python 多处理运行？

Question

Trying to find out the correct number of parallel processes to run with python multiprocessing .试图找出使用python multiprocessing运行的正确数量的并行进程。

Scripts below are run on an 8-core, 32 GB (Ubuntu 18.04) machine.以下脚本在 8 核 32 GB (Ubuntu 18.04) 机器上运行。 (There were only system processes and basic user processes running while the below was tested.) （以下测试时只有系统进程和基本用户进程在运行。）

Tested multiprocessing.Pool and apply_async with the following:使用以下内容测试了multiprocessing.Pool和apply_async ：

from multiprocessing import current_process, Pool, cpu_count
from datetime import datetime
import time

num_processes = 1 # vary this

print(f"Starting at {datetime.now()}")
start = time.perf_counter()

print(f"# CPUs = {cpu_count()}") # 8
num_procs = 5 * cpu_count() # 40


def cpu_heavy_fn():
    s = time.perf_counter()
    print(f"{datetime.now()}: {current_process().name}")
    x = 1
    for i in range(1, int(1e7)):
        x = x * i
        x = x / i
    t_taken = round(time.perf_counter() - s, 2)
    return t_taken, current_process().name


pool = Pool(processes=num_processes)

multiple_results = [pool.apply_async(cpu_heavy_fn, ()) for i in range(num_procs)]
results = [res.get() for res in multiple_results]
for r in results:
    print(r[0], r[1])

print(f"Done at {datetime.now()}")
print(f"Time taken = {time.perf_counter() - start}s")

Here are the results:结果如下：

num_processes total_time_taken
1 28.25
2 14.28
3 10.2
4 7.35
5 7.89
6 8.03
7 8.41
8 8.72
9 8.75
16 8.7
40 9.53

The following make sense to me:以下对我来说很有意义：

Running one process at a time takes about 0.7 seconds for each process, so running 40 should take about 28s, which agrees with what we observe above.每个进程一次运行一个进程大约需要 0.7 秒，因此运行 40 应该需要大约 28 秒，这与我们上面观察到的一致。
Running 2 processes at a time should halve the time and this is observed above (~14s).一次运行 2 个进程应该将时间减半，这在上面观察到（~14 秒）。
Running 4 processes at a time should further halve the time and this is observed above (~7s).一次运行 4 个进程应该进一步将时间减半，这在上面观察到（~7s）。
Increasing parallelism to more than the number of cores (8) should degrade performance (due to CPU contention) and this is observed (sort of).将并行度增加到超过内核数 (8) 应该会降低性能（由于 CPU 争用），这是可以观察到的（有点）。

What doesn't make sense is:没有意义的是：

Why does running 8 in parallel not twice as fast as running 4 in parallel ie why is it not ~3.5s?为什么并行运行 8 的速度不是并行运行 4 的两倍，即为什么不是 ~3.5s？
Why is running 5 to 8 in parallel at a time worse than running 4 at a time?为什么一次并行运行 5 到 8 个比一次运行 4 个更糟糕？ There are 8 cores, but still why is the overall run time worse?有 8 个核心，但为什么整体运行时间更差？ (When running 8 in parallel, htop showed all CPUs at near 100% utilization. When running 4 in parallel, only 4 of them were at 100% which makes sense.) （当并行运行 8 个时， htop显示所有 CPU 的利用率接近 100%。当并行运行 4 个时，其中只有 4 个处于 100%，这是有道理的。）

Answer 1

Q : " Why is running 5 to 8 in parallel at a time worse than running 4 at a time?"问： “为什么一次并行运行 5 到 8 个比一次运行 4 个更糟糕？”

Well, there are several reasons and we will start from a static, easiest observable one :好吧，有几个原因，我们将从一个静态的、最容易观察到的原因开始：

Since the silicon design ( for which they used a few hardware tricks ) does not scale beyond the 4.由于硅设计（他们为此使用了一些硬件技巧）没有扩展到超过 4.

So the last Amdahl's Law explained & promoted speedup from just +1 upscaled count of processors is 4 and any next +1 will not upscale the performance in that same way observed in the { 2, 3, 4 }-case :因此，最后一个阿姆达尔定律解释并促进了仅+1升级处理器数量的加速是 4 并且任何下一个 +1 都不会以在 { 2, 3, 4 }-case 中观察到的相同方式提升性能：

This lstopo CPU-topology map helps to start to decode WHY ( here for 4-cores, but the logic is the same as for your 8-core silicon - run lstopo on your device to see more details in vivo ) :此lstopo CPU 拓扑图有助于开始解码WHY （此处为 4 核，但逻辑与您的 8 核芯片相同 - 在您的设备上运行lstopo以查看更多体内细节）：

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Machine (31876MB)                                                                                                 │
│                                                                                                                   │
│ ┌────────────────────────────────────────────────────────────┐                      ┌───────────────────────────┐ │
│ │ Package P#0                                                │  ├┤╶─┬─────┼┤╶───────┤ PCI 10ae:1F44             │ │
│ │                                                            │      │               │                           │ │
│ │ ┌────────────────────────────────────────────────────────┐ │      │               │ ┌────────────┐  ┌───────┐ │ │
│ │ │ L3 (8192KB)                                            │ │      │               │ │ renderD128 │  │ card0 │ │ │
│ │ └────────────────────────────────────────────────────────┘ │      │               │ └────────────┘  └───────┘ │ │
│ │                                                            │      │               │                           │ │
│ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │               │ ┌────────────┐            │ │
│ │ │ L2 (2048KB)              │  │ L2 (2048KB)              │ │      │               │ │ controlD64 │            │ │
│ │ └──────────────────────────┘  └──────────────────────────┘ │      │               │ └────────────┘            │ │
│ │                                                            │      │               └───────────────────────────┘ │
│ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │                                             │
│ │ │ L1i (64KB)               │  │ L1i (64KB)               │ │      │               ┌───────────────┐             │
│ │ └──────────────────────────┘  └──────────────────────────┘ │      ├─────┼┤╶───────┤ PCI 10bc:8268 │             │
│ │                                                            │      │               │               │             │
│ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │               │ ┌────────┐    │             │
│ │ │ L1d (16KB) ││ L1d (16KB) │  │ L1d (16KB) ││ L1d (16KB) │ │      │               │ │ enp2s0 │    │             │
│ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │               │ └────────┘    │             │
│ │                                                            │      │               └───────────────┘             │
│ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │                                             │
│ │ │ Core P#0   ││ Core P#1   │  │ Core P#2   ││ Core P#3   │ │      │     ┌──────────────────┐                    │
│ │ │            ││            │  │            ││            │ │      ├─────┤ PCI 1002:4790    │                    │
│ │ │ ┌────────┐ ││ ┌────────┐ │  │ ┌────────┐ ││ ┌────────┐ │ │      │     │                  │                    │
│ │ │ │ PU P#0 │ ││ │ PU P#1 │ │  │ │ PU P#2 │ ││ │ PU P#3 │ │ │      │     │ ┌─────┐  ┌─────┐ │                    │
│ │ │ └────────┘ ││ └────────┘ │  │ └────────┘ ││ └────────┘ │ │      │     │ │ sr0 │  │ sda │ │                    │
│ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │     │ └─────┘  └─────┘ │                    │
│ └────────────────────────────────────────────────────────────┘      │     └──────────────────┘                    │
│                                                                     │                                             │
│                                                                     │     ┌───────────────┐                       │
│                                                                     └─────┤ PCI 1002:479c │                       │
│                                                                           └───────────────┘                       │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

A closer look, like the one from a call to hwloc -tool: lstopo-no-graphics -.ascii , shows where mutual processing independence ends - here at a level of shared L1 -instruction-cache ( the L3 one is shared either, yet at the top of the hierarchy and at such a size that bothers for large problems solvers only, not our case )仔细观察，比如调用hwloc -tool: lstopo-no-graphics -.ascii ，显示了相互处理独立性在哪里结束- 这里是共享L1指令缓存的级别（ L3也是共享的，但是在层次结构的顶部，并且规模如此之大，仅困扰大型问题解决者，而不是我们的情况）

Next comes a worse observable reason WHY even worse* on 8-processes :***接下来是一个更糟糕的可观察到的原因，为什么在 8-processes 上更糟糕：**

Q : "Why does running 8 in parallel not twice as fast as running 4 in parallel ie why is it not ~3.5s ?"问： “为什么并行运行 8 的速度没有并行运行 4 的两倍，即为什么不是~3.5s ？”

Because of thermal management .因为热管理。

The more work is loaded onto CPU-cores, the more heat is produced from driving electrons on ~3.5+ GHz through the silicon maze.加载到 CPU 内核上的工作越多，通过硅迷宫驱动~3.5+ GHz的电子产生的热量就越多。 Thermal constraints are those, that prevent any further performance boost in CPU computing powers, simply because of the Laws of physics, as we know them, do not permit to grow beyond some material-defined limits.热限制是那些阻止 CPU 计算能力进一步提高性能的限制，仅仅是因为我们所知道的物理定律不允许超出某些材料定义的限制。

So what comes next?那么接下来会发生什么？
The CPU-design has circumvented not the physics ( that is impossible ), but us, the users - by promising us a CPU chip having ~3.5+ GHz ( but in fact, the CPU can use this clock-rate only for small amounts of time - until the dissipated heat does not get the silicon close to the thermal-limits - and then, the CPU will decide to either reduce its own clock-rate as an overheating defensive step ( this reduces the performance, doesn't it? ) or some CPU-micro-architectures may hop ( move a flow of processing ) onto another, free, thus cooler, CPU-core ( which keeps a promise of higher clock-rate there ( at least for some small amount of time ) yet also reduces the performance, as the hop does not occur in zero-time and does not happen at zero-costs ( cache-losses, re-fetches etc ) CPU 设计没有绕过物理（那是不可能的），而是我们，用户 - 通过向我们承诺一个具有~3.5+ GHz的 CPU 芯片（但实际上，CPU 只能将这个时钟频率用于少量时间 - 直到散发的热量没有使硅接近热限制 - 然后，CPU 将决定降低自己的时钟速率作为过热防御步骤（这会降低性能，不是吗？）或者一些 CPU 微架构可能会跳跃（移动处理流程）到另一个免费的、因此更酷的 CPU 内核（它承诺在那里提供更高的时钟速率（至少在一小段时间内）但也降低了性能，因为跃点不会在零时间发生并且不会以零成本发生（缓存丢失，重新获取等）

This picture shows a snapshot of the case of core-hopping - cores 0-19 got too hot and are under the Thermal Throttling cap, while cores 20-39 can ( at least for now ) run at full speed:这张图片显示了核心跳跃情况的快照 - 核心0-19过热并且处于热节流上限之下，而核心20-39可以（至少目前）全速运行：

The Result?结果？

Both the thermal-constraints ( diving CPU into a pool of liquid nitrogen was demonstrated for a "popular" magazine show, yet is not a reasonable option for any sustainable computing, as the mechanical stress from going from deep frozen state into a 6+ GHz clock-rate steam-forming super-heater cracks the body of the CPU and will result in CPU-death from cracks and mechanical fatigue in but a few workload episodes - so a no-go zone, due to negative ROI for any serious project ).两种热约束（将 CPU 潜水到液氮池中已为“流行”杂志展示，但对于任何可持续计算来说都不是一个合理的选择，因为从深度冷冻状态到6+ GHz的机械应力时钟频率的蒸汽形成过热器会破坏 CPU 的主体，并会导致 CPU 因裂缝和机械疲劳而死亡，但在少数工作负载中 - 所以这是一个禁区，因为任何严肃项目的ROI 都是负的） .

Good cooling and right-sizing of the pool-of-workers, based on in-vivo pre-testing is the only sure bet here.基于体内预测试的员工池的良好冷却和合适的规模是这里唯一确定的赌注。

Other architecture :其他架构：

Answer 2

Most likely cause is that you are running the program on a CPU that uses simultaneous multithreading (SMT) , better known as hyper-threading on Intel units.最可能的原因是您在使用同时多线程 (SMT)的 CPU 上运行程序，在 Intel 单元上更广为人知的是超线程。 To cite after wiki, for each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible.在 wiki 之后引用，对于物理上存在的每个处理器内核，操作系统会寻址两个虚拟（逻辑）内核并在可能的情况下在它们之间共享工作负载。 That's what's happening here.这就是这里发生的事情。

Your OS says 8 cores, but in truth it's 4 cores with SMT.您的操作系统说 8 个内核，但实际上它是 4 个带 SMT 的内核。 The task is clearly CPU-bound, so any increase beyond physical number of cores does not bring any benefit, only overhead cost of multiprocessing.该任务显然受 CPU 限制，因此任何超出物理内核数量的增加都不会带来任何好处，只会带来多处理的开销成本。 That's why you see almost linear increase in performance until you reach (physical!) max.这就是为什么您会看到性能几乎呈线性增长，直到达到（物理！）最大值。 number of cores (4) and then decrease when the cores needs be shared for this very CPU-intensive task.核心数量 (4)，然后在需要为这个 CPU 密集型任务共享核心时减少。

如何找到理想数量的并行进程以使用 python 多处理运行？

问题描述

2 个解决方案

解决方案1
5 2020-03-05 00:00:15

Next comes a worse observable reason WHY even worse* on 8-processes :***接下来是一个更糟糕的可观察到的原因，为什么在 8-processes 上更糟糕：**

The Result?结果？

解决方案2
1 已采纳 2020-03-04 21:54:27

如何找到理想数量的并行进程以使用 python 多处理运行？

问题描述

2 个解决方案

解决方案1 5 2020-03-05 00:00:15

Next comes a worse observable reason WHY even worse on 8-processes :接下来是一个更糟糕的可观察到的原因，为什么在 8-processes 上更糟糕：

The Result?结果？

解决方案2 1 已采纳 2020-03-04 21:54:27

解决方案1
5 2020-03-05 00:00:15

Next comes a worse observable reason WHY even worse* on 8-processes :***接下来是一个更糟糕的可观察到的原因，为什么在 8-processes 上更糟糕：**

解决方案2
1 已采纳 2020-03-04 21:54:27