简体   繁体   中英

Understanding the usage of cpu cores of the multiprocessing module

I have a simple main() function that processes a huge amount of data. Since I have an 8-Core machine with lots of ram I was suggested to use the multiprocessing module of python to accelerate the processing. Each subprocess will take about 18 hours to finish.

Long story short, I have doubts that I understood the behaviour of the multiprocessing module correctly.

I somehow start the different subprocesses like this:

def main():
    data = huge_amount_of_data().
    pool = multiprocessing.Pool(processes=cpu_cores) # cpu_cores is set to 8, since my cpu has 8 cores.
    pool.map(start_process, data_chunk) # data_chunk is a subset data.

I understand that starting this script is a process of its own, namely the main process that finishes after all the subprocesses are finished. Obviously the Main process does not eat much resources, since it will only prepare the data at first and spawn the subprocesses. Will it use a core for its own, too? Meaning will only be able to start 7 subprocesses instead of the 8 I liked to start above?

The core question is: Can I spawn 8 subprocesses and be sure, that they will work correctly parallel to each other?

By the way, the subprocesses do not interact in any way with each other and when they are finished, they each generate an sqlite database file where they store the results. So even the result_storage is handled separately.

What I want to avoid, is that I spawn a process who will hinder the others to run at full speed. I need the code to terminate in the approximated 16 hours and not in double of the time, because I have more processes then cores. :-)

As an aside, if you create a Pool without arguments, if will deduce the number of available cores automatically, using the result of cpu_count() .

On any modern multitasking OS, no single program will generally be able to keep a core occupied and not allow other programs to run on it.

How many workers you should start depends on the characteristics of your start_process function. The number of cores isn't the only consideration.

If each worker process uses eg 1/4 of the available memory, starting more than 3 will lead to lots of swapping and a general slowdown. This condition is called "memory bound".

If the worker processes do other things than just calulations (eg read from or write to disk) they will have to wait a lot (since a disk is a lot slower than RAM; this is called "IO bound"). It might be worthwhile in that case to start more than one worker per core.

If the workers are not memory-bound or IO-bound, they will be bounded by the number of cores.

The OS will control which processes get assigned to which core, because there are other applications processes running you cannot guarantee that you have all the 8 cores available for your application.

The main thread will keep its own process, but because the map() function is blocked, the process is likely to be also blocked, not using any CPU core.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM