简体   繁体   English

python多处理池并不总是使用所有工作者

[英]python multiprocessing Pool not always using all workers


The problem: 问题:
When sending 1000 tasks to apply_async, they run in parallel on all 48 CPUs, but then sometimes fewer and fewer CPUs run, until only one CPU left is running, and only when the last one finishes its task, then all the CPUs continue running again each with a new task. 当将1000个任务发送到apply_async时,它们在所有48个CPU上并行运行,但有时运行的CPU越来越少,直到只剩下一个CPU运行,并且只有当最后一个CPU完成其任务时,所有CPU才会继续运行每个人都有一个新任务。 It shouldn't need to wait for any "task batch" like this.. 它不应该等待像这样的任何“任务批处理”..

My (simplified) code: 我的(简化)代码:

from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(json2features, (j,)) for j in jsons]
feats = [t.get() for t in tasks]

jsons = [...] is a list of about 1000 JSONs already loaded to memory and parsed to objects. jsons = [...]是已加载到内存并解析为对象的大约1000个JSON的列表。
json2features(json) does some CPU-heavy work on a json, and returns an array of numbers. json2features(json)json2features(json)执行一些CPU繁重的工作,并返回一个数字数组。
This function may take between 1 second and 15 minutes to run, and because of this I sort the jsons using a heuristic, st hopefully the longest tasks are first in the list, and thus start first. 此功能可能需要1秒到15分钟才能运行,因此我使用启发式排序jsons,希望最长的任务首先在列表中,因此首先启动。

The json2features function also prints when a task is finished and how long it took. json2features函数还会在任务完成时以及花费的时间内打印。 It all runs on an ubuntu server with 48 cores and like I said above, it starts out great, using all 47 cores. 它全部运行在一个拥有48个核心的ubuntu服务器上,就像我上面所说的那样,使用全部47个核心,它开始很棒。 Then as the tasks get completed, fewer and fewer cores run, which at first sounds perfectly ok, where it not because after the last core is finished (when I see its print to stdout), all CPUs start running again on new tasks, meaning it wasn't really the end of the list. 然后,当任务完成时,运行的核心越来越少,这听起来完全没问题,不是因为在最后一个核心完成之后(当我看到它打印到stdout时),所有CPU都开始在新任务上再次运行,这意味着这不是真正的清单结束。 It may do the same thing again, and then again for the actual end of the list. 它可能会再次执行相同的操作,然后再次执行列表的实际结束。

Sometimes it can be using just one core for 5 minutes, and when the task is finally done, it starts using all cores again, on new tasks. 有时它只能使用一个核心5分钟,当任务最终完成时,它会再次开始使用所有核心,处理新任务。 (So it's not stuck on some IPC overhead) (所以它不会停留在某些IPC开销上)

There are no repeated jsons, nor any dependencies between them (it's all static, fresh-from-disk data, no references etc..), nor any dependency between json2features calls (no global state or anything) except for them using the same terminal for their print. 没有重复的jsons,也没有任何依赖关系(它们都是静态的,新鲜的磁盘数据,没有引用等等),也没有json2features调用之间的任何依赖关系(没有全局状态或任何东西),除了它们使用相同的终端他们的印刷品。

I was suspicious that the problem was that a worker doesn't get released until get is called on its result, so I tried the following code: 我怀疑问题是工作人员在调用get结果之前不会被释放,所以我尝试了以下代码:

from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(print, (i,)) for i in range(1000)]
# feats = [t.get() for t in tasks]

And it does print all 1000 numbers, even though get isn't called. 并且它会打印所有1000个数字,即使没有调用get

I have ran out of ideas right now what the problem might be. 我现在已经没想到问题可能是什么了。
Is this really the normal behavior of Pool ? 这真的是Pool的正常行为吗?

Thanks a lot! 非常感谢!

The multiprocessing.Pool relies on a single os.pipe to deliver the tasks to the workers. multiprocessing.Pool依赖于单个os.pipe将任务交付给worker。

Usually on Unix , the default pipe size range from 4 to 64 Kb in size. 通常在Unix ,默认管道大小范围为4到64 Kb。 If the JSONs you are delivering are large in size, you might get the pipe clogged at any given point in time. 如果您提供的JSON大小很大,您可能会在任何给定的时间点堵塞管道。

This means that, while one of the workers is busy reading the large JSON from the pipe, all the other workers will starve. 这意味着,当其中一名工人忙于从管道中读取大型JSON时,所有其他工作人员都会饿死。

It is generally a bad practice to share large data via IPC as it leads to bad performance. 通过IPC共享大数据通常是一种不好的做法,因为它会导致性能不佳。 This is even underlined in the multiprocessing programming guidelines . 多处理编程指南中甚至强调了这一点。

Avoid shared state 避免共享状态

As far as possible one should try to avoid shifting large amounts of data between processes. 应尽可能避免在进程之间转移大量数据。

Instead of reading the JSON files in the main process, just send the workers their file names and let them open and read the content. 不要在主进程中读取JSON文件,只需向工作人员发送文件名,然后让他们打开并阅读内容。 You will surely notice an improvement in performance because you are moving the JSON loading phase in the concurrent domain as well. 您肯定会注意到性能的提高,因为您也在并发域中移动JSON加载阶段。

Note that the same is true also for the results. 请注意,结果也是如此。 A single os.pipe is used to return the results to the main process as well. 单个os.pipe也用于将结果返回到主进程。 If one or more workers clog the results pipe then you will get all the processes waiting for the main one to drain it. 如果一个或多个工作人员阻塞了结果管道,那么您将获得等待主管道排除它的所有进程。 Large results should be written to files as well. 应将大结果写入文件。 You can then leverage multithreading on the main process to quickly read back the results from the files. 然后,您可以利用主进程上的多线程快速回读文件中的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM