简体   繁体   中英

Why does joblib.Parallel() take much more time than a non-paralleled computation? Shouldn't Parallel() run faster than a non-paralleled computation?

A joblib module provides a simple helper class to write parallel for loops using multiprocessing.

This code uses a list comprehension to do the job :

import time
from math import sqrt
from joblib import Parallel, delayed

start_t = time.time()
list_comprehension = [sqrt(i ** 2) for i in range(1000000)]
print('list comprehension: {}s'.format(time.time() - start_t))

takes about 0.51s

list comprehension: 0.5140271186828613s

This code uses joblib.Parallel() constructor :

start_t = time.time()
list_from_parallel = Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(1000000))
print('Parallel: {}s'.format(time.time() - start_t))

takes about 31s

Parallel: 31.3990638256073s

Why is that? Shouldn't Parallel() become faster than a non-paralleled computation?

Here is part of the cpuinfo :

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping        : 0
microcode       : 0x1
cpu MHz         : 2200.000
cache size      : 56320 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes

Q : Shouldn't Parallel() become faster than a non-paralleled computation?

Well, that depends, depends a lot on circumstances ( be it a joblib.Parallel() or other way ).

There are no benefits that would ever come for free ( All such promises failed to deliver, since 1917 ... )

Plus,
it is very easy to happen to
pay way more ( on spawning processes for starting a multiprocessing )
than you receive back ( speedup expected over an original workflow ) ... so a due care is a must


The best first step:

Revisit the Amdahl's law revision and criticism about process-scheduling effects (speedup achieved form reorganisation of process-flows and using, at least in some part, a parallel process-scheduling).

The original Amdahl's formulation was not explicit on so called add-on "costs" one has to pay for going into parallel work-flows, that are not in the budget of the original, pure- [SERIAL] flow-of-work.

1) Process-instantiations was always expensive in python, as it first has to replicate as many copies (O/S-driven RAM-allocations sized for n_jobs (2)-copies + O/S-driven copying the RAM-image of the main python session) ( Thread-based multiprocessing does negative speedup, as there still remains GIL-lock re- [SERIAL] -isation of work-steps among all spawned threads, so you get nothing, while you have paid immense add-on costs for spawning + for each add-on GIL-ackquire/GIL-release step-dancing step - an awful antipattern for compute-intensive tasks, it may help mask some cases of I/O-related latencies, but definitely not a case for computing intensive workloads )

2) Add-on costs for parameters' transfer - you have to move some data from main process towards the new ones. It costs add-on time and you have to pay this add-on cost, that is not present in the original, pure- [SERIAL] workflow.

3) Add-on costs for results return transfer - you have to move some data from the new ones back to the originating (main) process. It costs add-on time and you have to pay this add-on cost, that is not present in the original, pure- [SERIAL] workflow.

4) Add-on costs for any data interchange ( better avoid any tempting to use this in parallel workflows - why? a) It blocks + b) It is expensive and you have to pay even more add-on costs for getting any further, which you do not pay in a pure- [SERIAL] original workflow ).


Q : Why does joblib.Parallel() take much more time than non-paralleled computation?

Simply, because you have to pay way, way more to launch the whole orchestrated circus, than you will receive back from such parallel work-flow organisation ( too small amount of work in math.sqrt( <int> ) to ever justify the relative-immense costs of spawning 2-full-copies of the original python-(main)-session + all the orchestration of dances to send just each and every ( <int> )-from-(main)-there and retrieving a returning each resulting ( <float> )-from-(joblib.Parallel()-process)-back-to-(main).

Your raw benchmarking times provide sufficient comparison of the accumulated costs to do the same result:

[SERIAL]-<iterator> feeding a [SERIAL]-processing storing into list[]:  0.51 [s]
[SERIAL]-<iterator> feeding [PARALLEL]-processing storing into list[]: 31.39 [s]

Raw estimate says about 30.9 second were " wasted " to do the same (small) amount of work just by forgetting about the add-on costs one has always to pay.


So, how to measure How Much You Have To Pay ... before one has to pay it... ?

Benchmark, benchmark, benchmark the actual code ... (prototype)

If interested in benchmarking these costs - how long does it take in [us] ( ie How Much You Have To Pay , before any useful work even starts ) to do 1), 2) or 3), there were posted benchmarking templates to test and validate these principal costs on one's own platform, before being able to decide, what is a minimum work-package, that can justify these un-avoidable expenses and yield a "positive" speedup any greater, ( best a lot greater ) >> 1.0000 when compared to the pure- [SERIAL] original.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM