Python：多处理，加载8/24核心

Question

I have a machine with 24 physical cores (at least I was told so) running Debian: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux . 我有一台机器有24个物理内核 （至少我被告知）运行Debian： Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux 。 It seems to be correct: 这似乎是正确的：

usr@machine:~/$ cat /proc/cpuinfo  | grep processor
processor   : 0
processor   : 1
<...>
processor   : 22
processor   : 23

I had some issues trying to load all cores with Python's multiprocessing.pool.Pool . 尝试使用Python的multiprocessing.pool.Pool加载所有内核时遇到了一些问题。 I used Pool(processes=None) ; 我用了Pool(processes=None) ; the docs say that Python uses cpu_count() if None is provided. 文档说如果提供None ，Python使用cpu_count() 。

Alas, only 8 cores were 100% loaded , others remained idle (I used htop to monitor CPU load). 唉， 只有8个核心100％加载 ，其他核心仍处于空闲状态（我使用htop来监控CPU负载）。 I thought that I cannot cook Pools properly and tried to invoke 24 processes "manually": 我以为我无法正确地烹饪Pools并尝试“手动”调用24个进程：

print 'Starting processes...'
procs = list()
for param_set in all_params:  # 24 items
    p = Process(target=_wrap_test, args=[param_set])
    p.start()
    procs.append(p)

print 'Now waiting for them.'
for p in procs:
    p.join()

I had 24 "greeting" messages from the processes I started: 我从我开始的流程中收到了24条“问候”消息：

Starting processes...
Executing combination: Session len: 15, delta: 10, ratio: 0.1, eps_relabel: 0.5, min_pts_lof: 5, alpha: 0.01, reduce: 500
< ... 22 more messages ... >
Executing combination: Session len: 15, delta: 10, ratio: 0.1, eps_relabel: 0.5, min_pts_lof: 7, alpha: 0.01, reduce: 2000
Now waiting for them.

But still only 8 cores were loaded: 但仍然只加载了8个核心 ：

在此输入图像描述

I've read here on SO that there may be issues with numpy , OpenBLAS and multicore execution. 我在这里看到， numpy ，OpenBLAS和多核执行可能存在问题。 This is how I start my code: 这就是我启动代码的方式：

OPENBLAS_MAIN_FREE=1 python -m tests.my_module

And after all imports I do: 在完成所有进口之后我做了：

os.system("taskset -p 0xff %d" % os.getpid())

So, here is the question: what should I do to have 100%-load on all cores? 所以，问题是：我应该怎样做才能在所有内核上实现100％的负载？ Is this just my poor Python usage or it has something to do with OS limitations on multicore machines? 这只是我糟糕的Python使用情况还是与多核计算机上的操作系统限制有关？

UPDATED : one more interesting thing is some inconsistency within htop output. 更新：另一个有趣的事情是htop输出中的一些不一致。 If you look at the image above, you'll see that the table below the CPU load bars shows 30-50% load for much more than 8 cores, which is definitely different from what load bars say. 如果你看一下上面的图像，你会发现CPU负载条下面的表显示了超过8个内核的30-50％负载，这与负载条所说的完全不同。 Then, top seems to agree with those bars: 8 cores 100%-loaded, others idle. 然后， top似乎同意这些条：8个核心100％负载，其他闲置。

UPDATED ONCE AGAIN: 再次更新：

I used this rather popular post on SO when I added the os.system("taskset -p 0xff %d" % os.getpid()) line after all imports. 当我在所有导入后添加os.system("taskset -p 0xff %d" % os.getpid())行时，我在SO上使用了这个相当受欢迎的帖子。 I have to admit that I didn't think too much when I did that, especially after reading this: 我必须承认，当我这样做时，我并没有想太多，特别是在阅读之后：

With this line pasted in after the module imports, my example now runs on all cores 在模块导入后粘贴此行，我的示例现在在所有核心上运行

I'm a simple man. 我是一个单纯的男人。 I see "works like a charm", I copy and paste. 我看到“像魅力一样”，我复制并粘贴。 Anyway, while playing with my code I eventually removed this line. 无论如何，在玩我的代码时，我最终删除了这一行。 After that my code began executing on all 24 cores for the "manual" Process starting scenario. 之后，我的代码开始在所有24个核心上执行“手动” Process启动方案。 For the Pool scenario the same problem remained, no matter whether the affinity trick was used or not. 对于Pool场景，无论是否使用了关联技巧，仍然存在同样的问题。

I don't think it's a real answer 'cause I don't know what the issue is with Pool , but at least I managed to get all cores fully loaded. 我认为这不是一个真正的答案，因为我不知道Pool的问题是什么，但至少我设法让所有核心都满载。 Thank you! 谢谢！

Answer 1

Even though you solved the issue I'll try to explain it to clarify the ideas. 即使你解决了这个问题，我也会尝试解释它以澄清这些想法。

For what I read around, numpy does a lot of "magic" to improve performance. 对于我所读到的内容，numpy为提高性能做了很多“魔术”。 One of the magic tricks is to set the CPU affinity of the process. 其中一个神奇的技巧是设置进程的CPU亲和力。

The CPU affinity is an optimisation of the OS scheduler. CPU亲和性是OS调度程序的优化。 It basically enforces a given process to be always run on the same CPU core. 它基本上强制给定的进程始终在同一CPU核心上运行。

This improves performance reducing the amount of times the CPU cache is invalidated and increasing the benefits from reference locality. 这样可以提高性能，减少CPU缓存失效的次数，并提高参考局部性的优势。 On high computational tasks these factors are indeed important. 在高计算任务中，这些因素确实很重要。

What I don't like of numpy is the fact that it does all this implicitly. 我不喜欢numpy的事实是它隐含地做了所有这些。 Often puzzling developers. 经常让开发人员感到困惑。

The fact that your processes where not running on all the cores was due to the fact that numpy sets the affinity to the parent process when you import the module. 您的进程未在所有核心上运行的事实是由于numpy在导入模块时为父进程设置了亲和关系。 Then, when you spawn the new processes the affinity is inherited leading to all the processes fighting for few cores instead of efficiently using all the available ones. 然后，当您生成新进程时，将继承关联，导致所有进程争用少数核心，而不是有效地使用所有可用的进程。

The os.system("taskset -p 0xff %d" % os.getpid()) command instruct the OS to set the affinity back on all the cores solving your issue. os.system("taskset -p 0xff %d" % os.getpid())命令指示操作系统在解决问题的所有核心上设置关联。

If you want to see it working on the Pool you can do the following trick. 如果你想看到它在Pool上工作，你可以做以下技巧。

import os
from multiprocessing import Pool


def set_affinity_on_worker():
    """When a new worker process is created, the affinity is set to all CPUs"""
    print("I'm the process %d, setting affinity to all CPUs." % os.getpid())
    os.system("taskset -p 0xff %d" % os.getpid())


if __name__ == '__main__':
    p = Pool(initializer=set_affinity_on_worker)
    ...

Answer 2

In os.system("taskset -p 0xff %d" % os.getpid()) , 0xff is essentially a hexadecimal bitmask, corresponding to 1111 1111. Each bit in the bitmask corresponds to a CPU core. 在os.system("taskset -p 0xff %d" % os.getpid()) ， 0xff本质上是十六进制位掩码，对应于1111 1111.位掩码中的每个位对应一个CPU内核。 The bit value 1 means that the process can be executed on the corresponding CPU core. 位值1表示可以在相应的CPU内核上执行该过程。 Therefore, to run on 24 cores you should use a mask of 0xffffff instead of 0xff. 因此，要在24个内核上运行，您应该使用0xffffff而不是0xff的掩码。

Correct command: 正确的命令：

os.system("taskset -p 0xffffff %d" % os.getpid())

Python：多处理，加载8/24核心

问题描述

2 个解决方案

解决方案1
3 2015-07-11 10:59:19

解决方案2
0 2018-05-01 16:51:56

Python：多处理，加载8/24核心

问题描述

2 个解决方案

解决方案1 3 2015-07-11 10:59:19

解决方案2 0 2018-05-01 16:51:56

解决方案1
3 2015-07-11 10:59:19

解决方案2
0 2018-05-01 16:51:56