A joblib
module provides a simple helper class to write parallel for loops using multiprocessing.
This code uses a list comprehension to do the job :
import time
from math import sqrt
from joblib import Parallel, delayed
start_t = time.time()
list_comprehension = [sqrt(i ** 2) for i in range(1000000)]
print('list comprehension: {}s'.format(time.time() - start_t))
takes about 0.51s
list comprehension: 0.5140271186828613s
This code uses joblib.Parallel()
constructor :
start_t = time.time()
list_from_parallel = Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(1000000))
print('Parallel: {}s'.format(time.time() - start_t))
takes about 31s
Parallel: 31.3990638256073s
Why is that? Shouldn't Parallel()
become faster than a non-paralleled computation?
Here is part of the cpuinfo
:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
microcode : 0x1
cpu MHz : 2200.000
cache size : 56320 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
Q : Shouldn't
Parallel()
become faster than a non-paralleled computation?
Well, that depends, depends a lot on circumstances ( be it a joblib.Parallel()
or other way ).
There are no benefits that would ever come for free ( All such promises failed to deliver, since 1917 ... )
Plus,
it is very easy to happen to
pay way more ( on spawning processes for starting a multiprocessing )
than you receive back ( speedup expected over an original workflow ) ... so a due care is a must
Revisit the Amdahl's law revision and criticism about process-scheduling effects (speedup achieved form reorganisation of process-flows and using, at least in some part, a parallel process-scheduling).
The original Amdahl's formulation was not explicit on so called add-on "costs" one has to pay for going into parallel work-flows, that are not in the budget of the original, pure- [SERIAL]
flow-of-work.
1) Process-instantiations was always expensive in python, as it first has to replicate as many copies (O/S-driven RAM-allocations sized for n_jobs
(2)-copies + O/S-driven copying the RAM-image of the main python session) ( Thread-based multiprocessing does negative speedup, as there still remains GIL-lock re- [SERIAL]
-isation of work-steps among all spawned threads, so you get nothing, while you have paid immense add-on costs for spawning + for each add-on GIL-ackquire/GIL-release step-dancing step - an awful antipattern for compute-intensive tasks, it may help mask some cases of I/O-related latencies, but definitely not a case for computing intensive workloads )
2) Add-on costs for parameters' transfer - you have to move some data from main process towards the new ones. It costs add-on time and you have to pay this add-on cost, that is not present in the original, pure- [SERIAL]
workflow.
3) Add-on costs for results return transfer - you have to move some data from the new ones back to the originating (main) process. It costs add-on time and you have to pay this add-on cost, that is not present in the original, pure- [SERIAL]
workflow.
4) Add-on costs for any data interchange ( better avoid any tempting to use this in parallel workflows - why? a) It blocks + b) It is expensive and you have to pay even more add-on costs for getting any further, which you do not pay in a pure- [SERIAL]
original workflow ).
Q : Why does
joblib.Parallel()
take much more time than non-paralleled computation?
Simply, because you have to pay way, way more to launch the whole orchestrated circus, than you will receive back from such parallel work-flow organisation ( too small amount of work in math.sqrt( <int> )
to ever justify the relative-immense costs of spawning 2-full-copies of the original python-(main)-session + all the orchestration of dances to send just each and every ( <int>
)-from-(main)-there and retrieving a returning each resulting ( <float>
)-from-(joblib.Parallel()-process)-back-to-(main).
Your raw benchmarking times provide sufficient comparison of the accumulated costs to do the same result:
[SERIAL]-<iterator> feeding a [SERIAL]-processing storing into list[]: 0.51 [s]
[SERIAL]-<iterator> feeding [PARALLEL]-processing storing into list[]: 31.39 [s]
Raw estimate says about 30.9 second were " wasted " to do the same (small) amount of work just by forgetting about the add-on costs one has always to pay.
Benchmark, benchmark, benchmark the actual code ... (prototype)
If interested in benchmarking these costs - how long does it take in [us]
( ie How Much You Have To Pay , before any useful work even starts ) to do 1), 2) or 3), there were posted benchmarking templates to test and validate these principal costs on one's own platform, before being able to decide, what is a minimum work-package, that can justify these un-avoidable expenses and yield a "positive" speedup any greater, ( best a lot greater ) >> 1.0000
when compared to the pure- [SERIAL]
original.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.