[英]Why Multiprocessing is taking more time than sequential processing?
The below code is taking around 15 seconds to get the result.下面的代码大约需要 15 秒才能得到结果。 But when I run a it sequentially it only takes around 11 seconds.但是当我按顺序运行它时,只需要大约 11 秒。 What can be the reason for this?这可能是什么原因?
import multiprocessing
import os
import time
def square(x):
# print(os.getpid())
return x*x
if __name__=='__main__':
start_time = time.time()
p = multiprocessing.Pool()
r = range(100000000)
p1 = p.map(square,r)
end_time = time.time()
print('time_taken::',end_time-start_time)
Sequential code顺序码
start_time = time.time()
d = list(map(square,range(100000000)))
end_time = time.time()
Regarding your code example, there are two important factors which influence runtime performance gains achievable by parallelization:关于您的代码示例,有两个重要因素会影响并行化可实现的运行时性能增益:
First, you have to take the administrative overhead into account.首先,您必须考虑管理开销。 This means, that spawning new processes is rather expensive in comparison to simple arithmetic operations.这意味着,与简单的算术运算相比,产生新进程相当昂贵。 Therefore, you gain performance, when the computation's complexity exceeds a certain threshold.因此,当计算的复杂度超过某个阈值时,您可以获得性能。 Which was not the case in your example above.在您上面的示例中,情况并非如此。
Secondly, you have to think of a "clever way" of splitting your computation into parts which could be independently executed.其次,你必须想出一种“聪明的方法”,将你的计算分成可以独立执行的部分。 In the given code example, you can optimize the chunks you pass to the worker processes created by multiprocessing.Pool
, so that each process has a self contained package of computations to perform.在给定的代码示例中,您可以优化传递给由multiprocessing.Pool
创建的工作进程的块,以便每个进程都有一个自包含的 package 计算来执行。
Eg, this could be accomplished with the following modifications of your code:例如,这可以通过对您的代码进行以下修改来完成:
def square(x):
return x ** 2
def square_chunk(i, j):
return list(map(square, range(i, j)))
def calculate_in_parallel(n, c=4):
"""Calculates a list of squares in a parallelized manner"""
result = []
step = math.ceil(n / c)
with Pool(c) as p:
partial_results = p.starmap(
square_chunk, [(i, min(i + step, n)) for i in range(0, n, step)]
)
for res in partial_results:
result += res
return result
Please note, that I used the operation x**2
(instead of the heavily optimized x*x
) to increase the load and underline resulting runtime differences.请注意,我使用了操作x**2
(而不是经过高度优化的x*x
)来增加负载并强调由此产生的运行时差异。
Here, the Pool
's starmap()
-function is used which unpacks arguments of the passed tuples.在这里,使用了Pool
的starmap()
-函数来解包传递的元组的 arguments。 Using it, we can effectively pass more than one argument to the mapped function.使用它,我们可以有效地将多个参数传递给映射的 function。 Furthermore, we distribute the workload evenly to the amount of available cores.此外,我们将工作负载平均分配给可用内核的数量。 On each core the range of numbers between (i, min(i + step, n))
is calculated, whereas the step
denotes the chunksize, calculated as the maximum_number divided by the count of CPU.在每个核心上,计算(i, min(i + step, n))
之间的数字范围,而step
表示块大小,计算为 maximum_number 除以 CPU 计数。
By running the code with different parametrizations, one can clearly see, that the performance gain increases when the maximum number (denoted n
) increases.通过运行具有不同参数化的代码,可以清楚地看到,当最大数量(表示为n
)增加时,性能增益会增加。 As expected, when more cores are used in parallel the runtime is reduced as well.正如预期的那样,当并行使用更多内核时,运行时间也会减少。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.