简体   繁体   English

如何提高 Python 中并行循环的效率

[英]How to improve efficiency on parallel loops in Python

I'm intrigued on how less efficient are parallel loops in Python compared to parloop from Matlab.我很好奇 Python 中的并行循环与 Matlab 中的parloop相比效率有多低。 Here I am presenting a simple root-finding problem brute-forcing initial 10^6 initial guesses between a and b .在这里,我提出了一个简单的寻根问题,在ab之间强制进行初始 10^6 初始猜测。

import numpy as np
from scipy.optimize import root
import matplotlib.pyplot as plt
import multiprocessing

# define the function to find the roots
func = lambda x: np.sin(3*np.pi*np.cos(np.pi*x)*np.sin(np.pi*x))

def forfunc(x0):
    q = [root(func, xi).x for xi in x0]
    q = np.array(q).T[0]
    return q

# variables os the problem
a = -3
b = 5
n = int(1e6)
x0 = np.linspace(a,b,n) # list of initial guesses

# the single-process loop
q = forfunc(x0)

# parallel loop
nc = 4
pool = multiprocessing.Pool(processes=nc)
q = np.hstack(pool.map(forfunc,np.split(x0,nc)))
pool.close()

The single-process loop takes 1min 26s of wall time and the parallel loop takes 1min 7s.单进程循环需要 1 分 26 秒的时间,并行循环需要 1 分 7 秒。 I see some improvement as the speedup is 1.28, but the efficiency (timeloop/timeparallel/n_process) is 0.32 in this case.我看到一些改进,因为加速为 1.28,但在这种情况下效率(timeloop/timeparallel/n_process)为 0.32。

What is happening here and how to improve this efficiency?这里发生了什么以及如何提高这种效率? Am I doing something wrong?难道我做错了什么?

I also tried using dask.delayed in two ways:我还尝试通过两种方式使用dask.delayed

import dask

# Every call is a delayed object
q = dask.compute(*[dask.delayed(func)(xi) for xi in x0])

# Every chunk is a delayed object
q = dask.compute(*[dask.delayed(forfunc)(x0i) for x0i in np.split(x0,nc)])

And here both takes more time than the single-process loop.并且这里两者都比单进程循环花费更多的时间。 The wall time for the first try is 3min and for the second try it took 1min 27s.第一次尝试的墙上时间是 3 分钟,第二次尝试用了 1 分 27 秒。

What's Happening with Dask (or Spark) Dask(或 Spark)发生了什么

From your single-process tests, your loop executes one million tasks in 90 seconds.从您的单进程测试来看,您的循环在 90 秒内执行了 100 万个任务。 Thus, each task takes your CPU about 90 microseconds in the average case.因此,在平均情况下,每个任务占用 CPU 大约 90 微秒。

In distributed computing frameworks like Dask or Spark that provide flexibility and resiliency, tasks have a small overhead associated with them.在 Dask 或 Spark 等提供灵活性和弹性的分布式计算框架中,任务的相关开销很小。 Dask's overhead is as low as 200 microseconds per task. Dask 的开销低至每个任务200 微秒 The Spark 3.0 documentation suggests that Spark can support tasks as short as 200 milliseconds , which perhaps means Dask actually has 1000x less overhead than Spark. Spark 3.0 文档表明 Spark 可以支持短至 200毫秒的任务,这可能意味着 Dask 的开销实际上比 Spark 少 1000 倍。 It sounds like Dask is actually doing really well here!听起来 Dask 实际上在这里做得非常好!

If your tasks are faster than the per-task overhead of your framework, you'll simply see worse performance using it relative to manually distributing your work across the same number of machines/cores.如果您的任务比框架的每个任务开销更快,则相对于在相同数量的机器/内核上手动分配您的工作,您只会看到使用它的性能更差。 In this case, you're running into that scenario.在这种情况下,您会遇到这种情况。

In your chunked data Dask example you have only a few tasks, so you see better performance from reduced overhead.在分块数据 Dask 示例中,您只有几个任务,因此您可以通过减少开销看到更好的性能。 But, you are either likely taking a small performance hit from the overhead of Dask relative to raw multiprocessing, or you're not using a Dask cluster and running the tasks a single process.但是,相对于原始多处理,您可能会从 Dask 的开销中获得较小的性能损失,或者您没有使用 Dask 集群并在单个进程中运行任务。

Multiprocessing (and Dask) Should Help多处理(和 Dask)应该有帮助

Your results with multiprocessing are generally unexpected for this kind of embarrassingly parallel problem.对于这种令人尴尬的并行问题,您的多处理结果通常是出乎意料的。 You may want to confirm the number of physical cores on your machine and in particular make sure nothing else is actively utilizing your CPU cores.您可能需要确认机器上的物理内核数量,特别是确保没有其他东西正在积极使用您的 CPU 内核。 Without knowing anything else, I would guess that's the culprit.在不知道其他任何事情的情况下,我猜这就是罪魁祸首。

On my laptop with two physical cores, your example takes:在我有两个物理内核的笔记本电脑上,您的示例采用:

  • 2min 1s for the single process loop单进程循环 2min 1s
  • 1min 2s for two processes两道工序1分2秒
  • 1min for four processes四道工序1分钟
  • 1min 5s for a chunked Dask example with nc=2 to split into two chunks and a LocalCluster of two workers and one thread per worker.对于nc=2的分块 Dask 示例,需要 1 分钟 5 秒才能拆分为两个块和一个由两个工作人员组成的 LocalCluster,每个工作人员有一个线程。 It may be worth double checking you're running on a cluster.可能值得仔细检查您是否在集群上运行。

Getting a roughly 2x speedup with two processes is line with expectations on my laptop, as is seeing minimal or no benefit from more processes for this CPU bound task.通过两个进程获得大约 2 倍的加速符合我的笔记本电脑的预期,因为对于这个 CPU 密集型任务,从更多进程中看到的好处很少或没有好处。 Dask also adds a bit of overhead relative to raw multiprocessing. Dask 还增加了一些相对于原始多处理的开销。

%%time
​
# the single-process loop
q = forfunc(x0)
CPU times: user 1min 55s, sys: 1.68 s, total: 1min 57s
Wall time: 2min 1s
%%time
​
# parallel loop
nc = 2
pool = multiprocessing.Pool(processes=nc)
q = np.hstack(pool.map(forfunc,np.split(x0,nc)))
pool.close()
CPU times: user 92.6 ms, sys: 70.8 ms, total: 163 ms
Wall time: 1min 2s
%%time
​
# parallel loop
nc = 4
pool = multiprocessing.Pool(processes=nc)
q = np.hstack(pool.map(forfunc,np.split(x0,nc)))
pool.close()
CPU times: user 118 ms, sys: 94.6 ms, total: 212 ms
Wall time: 1min
from dask.distributed import Client, LocalCluster, wait
client = Client(n_workers=2, threads_per_worker=1)

%%time
​
nc = 2
chunks = np.split(x0,nc)
client.scatter(chunks, broadcast=True)
q = client.compute([dask.delayed(forfunc)(x0i) for x0i in chunks])
wait(q)
/Users/nickbecker/miniconda3/envs/prophet/lib/python3.7/site-packages/distributed/worker.py:3382: UserWarning: Large object of size 4.00 MB detected in task graph: 
  (array([1.000004, 1.000012, 1.00002 , ..., 4.99998 ... 2, 5.      ]),)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)
CPU times: user 3.67 s, sys: 324 ms, total: 4 s
Wall time: 1min 5s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM