简体   繁体   English

了解多处理模块的cpu核心的使用

[英]Understanding the usage of cpu cores of the multiprocessing module

I have a simple main() function that processes a huge amount of data. 我有一个简单的main()函数来处理大量数据。 Since I have an 8-Core machine with lots of ram I was suggested to use the multiprocessing module of python to accelerate the processing. 由于我有一台带有大量内存的8核机器,我建议使用python的multiprocessing模块来加速处理。 Each subprocess will take about 18 hours to finish. 每个子流程大约需要18个小时才能完成。

Long story short, I have doubts that I understood the behaviour of the multiprocessing module correctly. 长话短说,我怀疑我是否正确理解了multiprocessing模块的行为。

I somehow start the different subprocesses like this: 我以某种方式启动不同的子进程,如下所示:

def main():
    data = huge_amount_of_data().
    pool = multiprocessing.Pool(processes=cpu_cores) # cpu_cores is set to 8, since my cpu has 8 cores.
    pool.map(start_process, data_chunk) # data_chunk is a subset data.

I understand that starting this script is a process of its own, namely the main process that finishes after all the subprocesses are finished. 我知道启动这个脚本是一个自己的过程,即在所有子过程完成后完成的主过程。 Obviously the Main process does not eat much resources, since it will only prepare the data at first and spawn the subprocesses. 显然,Main进程不会占用太多资源,因为它只会首先准备数据并生成子进程。 Will it use a core for its own, too? 它也将自己使用核心吗? Meaning will only be able to start 7 subprocesses instead of the 8 I liked to start above? 意义只能启动7个子进程而不是我喜欢从上面开始的8个子进程?

The core question is: Can I spawn 8 subprocesses and be sure, that they will work correctly parallel to each other? 核心问题是:我可以生成8个子进程并确保它们能够正常并行运行吗?

By the way, the subprocesses do not interact in any way with each other and when they are finished, they each generate an sqlite database file where they store the results. 顺便说一下,子进程不以任何方式相互交互,当它们完成时,它们每个都生成一个sqlite数据库文件,用于存储结果。 So even the result_storage is handled separately. 所以即使是result_storage也是单独处理的。

What I want to avoid, is that I spawn a process who will hinder the others to run at full speed. 我想要避免的是,我会产生一个阻碍其他人全速奔跑的过程。 I need the code to terminate in the approximated 16 hours and not in double of the time, because I have more processes then cores. 我需要代码在大约16个小时内终止,而不是在两倍的时间内终止,因为我有更多的进程然后核心。 :-) :-)

As an aside, if you create a Pool without arguments, if will deduce the number of available cores automatically, using the result of cpu_count() . cpu_count() ,如果您创建一个没有参数的池,if将使用cpu_count()的结果自动推断出可用内核的数量。

On any modern multitasking OS, no single program will generally be able to keep a core occupied and not allow other programs to run on it. 在任何现代多任务操作系统上,没有任何一个程序通常能够保持核心占用而不允许其他程序在其上运行。

How many workers you should start depends on the characteristics of your start_process function. 应该启动多少个工作器取决于start_process函数的特性。 The number of cores isn't the only consideration. 核心数量不是唯一的考虑因素。

If each worker process uses eg 1/4 of the available memory, starting more than 3 will lead to lots of swapping and a general slowdown. 如果每个工作进程使用例如1/4的可用内存,则启动超过3将导致大量交换和一般的减速。 This condition is called "memory bound". 这种情况称为“内存限制”。

If the worker processes do other things than just calulations (eg read from or write to disk) they will have to wait a lot (since a disk is a lot slower than RAM; this is called "IO bound"). 如果工作进程执行的不仅仅是计算(例如读取或写入磁盘),则必须等待很多(因为磁盘比RAM慢很多;这称为“IO绑定”)。 It might be worthwhile in that case to start more than one worker per core. 在这种情况下,每个核心启动多个工作人员可能是值得的。

If the workers are not memory-bound or IO-bound, they will be bounded by the number of cores. 如果工作者不受内存限制或IO限制,则它们将受核心数量的限制。

The OS will control which processes get assigned to which core, because there are other applications processes running you cannot guarantee that you have all the 8 cores available for your application. 操作系统将控制将哪些进程分配给哪个核心,因为有其他运行的应用程序进程无法保证您的应用程序可以使用所有8个核心。

The main thread will keep its own process, but because the map() function is blocked, the process is likely to be also blocked, not using any CPU core. 主线程将保留自己的进程,但由于map()函数被阻止,进程很可能也被阻塞,而不是使用任何CPU核心。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM