python numpy / scipy的多处理速度慢

Question

I have a very processor-intensive task that takes a ballpark of 13-20 hours to complete, depending on the machine. 我有一个非常耗费处理器的任务，需要13-20个小时才能完成，具体取决于机器。 Seemed like an obvious choice for parallelization via the multiprocessing library. 似乎是通过多处理库进行并行化的明显选择。 Problem is... the more processes I spawn, the slower the same code gets. 问题是...我产生的进程越多，相同代码的速度就越慢。

Time per iteration (ie the time it takes to run sparse.linalg.cg): 每次迭代的时间（即运行sparse.linalg.cg所需的时间）：

183s 1 process 183s 1程序

245s 2 processes 245s 2个过程

312s 3 processes 312s 3个过程

383s 4 processes 383s 4进程

Granted, while 2 processes takes a little over 30% more time for each iteration, it's doing 2 at the same time, so it's still marginally faster. 当然，虽然2个进程每次迭代花费的时间略多于30％，但它同时执行2个进程，因此速度仍要快一些。 But I would not expect the actual math operations themselves to be slower! 但是我不希望实际的数学运算本身会变慢！ These timers don't start until after whatever overhead multiprocessing adds. 这些计时器要等到任何开销的多处理之后才开始。

Here's a stripped down version of my code. 这是我的代码的精简版。 The problem line is the sparse.linalg.cg one. 问题行是sparse.linalg.cg之一。 (I've tried things like using MKL vs OpenBLAS, and forcing them to run in a single thread. Also tried manually spawning Processes instead of using a pool. No luck.) （我尝试过使用MKL和OpenBLAS之类的方法，并强迫它们在单个线程中运行。还尝试了手动生成进程而不是使用池。没有运气。）

def do_the_thing_partial(iteration: int, iter_size: float, outQ : multiprocessing.Queue, L: int, D: int, qP: int, elec_ind: np.ndarray, Ic: int, ubi2: int,
                 K : csc_matrix, t: np.ndarray, dip_ind_t: np.ndarray, conds: np.ndarray, hx: float, dstr: np.ndarray):
    range_start = ceil(iteration * iter_size)
    range_end = ceil((iteration + 1) * iter_size)

    for rr in range(range_start, range_end):
        # do some things (like generate F from rr)
        Vfull=sparse.linalg.cg(K,F,tol=1e-11,maxiter=1200)[0] #Solve the system
        # do more things
        outQ.put((rr, Vfull))


def do_the_thing(L: int, D: int, qP: int, elec_ind: np.ndarray, Ic: int, ubi2: int,
                 K : csc_matrix, t: np.ndarray, dip_ind_t: np.ndarray, conds: np.ndarray, hx: float, dstr: np.ndarray):
    num_cores = cpu_count()
    iterations_per_process = (L-1) / num_cores  # 257 / 8 ?

    outQ = multiprocessing.Queue()

    pool = multiprocessing.Pool(processes=num_cores)

    [pool.apply_async(do_the_thing_partial,
                      args=(i, iterations_per_process, outQ, L, D, qP, elec_ind, Ic, ubi2, K, t, dip_ind_t, conds, hx, dstr),
                      callback=None)
     for i in range(num_cores)]

    pool.close()
    pool.join()

    for res in outQ:
        # combine results and return here

Am I doing something wrong, or is it impossible to parallelize sparse.linalg.cg because of its own optimizations? 我是在做错什么，还是由于其自身的优化而无法并行化sparse.linalg.cg？

Thanks! 谢谢！

Answer 1

Here's an example of how to get a speedup using Ray (a library for parallel and distributed Python). 这是一个如何使用Ray （并行和分布式Python库）加速的示例。 You can run the code below after doing pip install ray (on Linux or MacOS). 在执行pip install ray （在Linux或MacOS上），您可以运行以下代码。

Running the serial version of the computation below (eg, doing scipy.sparse.linalg.cg(K, F, tol=1e-11, maxiter=100) 20 times) takes 33 seconds on my laptop. 在笔记本电脑上运行下面的串行计算版本（例如，执行scipy.sparse.linalg.cg(K, F, tol=1e-11, maxiter=100) 20次）需要33秒 。 Timing the code below for launching the 20 tasks and getting the results takes 8.7 seconds . 定时下面的代码来启动20个任务并获得结果需要8.7秒 。 My laptop has 4 physical cores, so this is almost a 4x speedup . 我的笔记本电脑有4个物理核心，因此几乎是4倍的加速 。

I changed your code a lot, but I think I preserved the essence of it. 我对您的代码进行了很多更改，但我认为我保留了其本质。

import numpy as np
import ray
import scipy.sparse
import scipy.sparse.linalg

# Consider passing in 'num_cpus=psutil.cpu_count(logical=True)'.
ray.init()

num_elements = 10**7
dim = 10**4

data = np.random.normal(size=num_elements)
row_indices = np.random.randint(0, dim, size=num_elements)
col_indices = np.random.randint(0, dim, size=num_elements)

K = scipy.sparse.csc_matrix((data, (row_indices, col_indices)))

@ray.remote
def solve_system(K, F):
    # Solve the system.
    return scipy.sparse.linalg.cg(K, F, tol=1e-11, maxiter=100)[0]

# Store the array in shared memory first. This is optional. That is, you could
# directly pass in K, however, this should speed it up because this way it only
# needs to serialize K once. On the other hand, if you use a different value of
# "K" for each call to "solve_system", then this doesn't help.
K_id = ray.put(K)

# Time the code below!

result_ids = []
for _ in range(20):
    F = np.random.normal(size=dim)
    result_ids.append(solve_system.remote(K_id, F))

# Run a bunch of tasks in parallel. Ray will schedule one per core.
results = ray.get(result_ids)

The call to ray.init() starts the Ray worker processes. 调用ray.init()启动Ray工作进程。 The call to solve_system.remote submits the tasks to the workers. 对solve_system.remote的调用将任务提交给工作人员。 Ray will schedule one per core by default, though you can specify that a particular task requires more resources (or fewer resources) via @ray.remote(num_cpus=2) . 尽管您可以通过@ray.remote(num_cpus=2)指定一个特定的任务需要更多的资源（或更少的资源），但是Ray会默认为每个内核调度一个。 You can also specify GPU resources and other custom resources. 您还可以指定GPU资源和其他自定义资源。

The call to solve_system.remote immediately returns an ID representing the eventual output of the computation, and the call to ray.get takes the IDs and retrieves the actual results of the computation (so ray.get will wait until the tasks finish executing). 对solve_system.remote的调用会立即返回一个代表最终计算结果的ID，而对ray.get的调用将获取这些ID并检索实际的计算结果（因此ray.get将等待任务完成执行）。

Some notes 一些注意事项

On my laptop, scipy.sparse.linalg.cg seems to limit itself to a single core, but if it doesn't, then you should consider pinning each worker to a specific core to avoid contention between worker processes (you can do this on Linux by doing psutil.Process().cpu_affinity([i]) where i is the index of the core to bind to. 在我的笔记本电脑上， scipy.sparse.linalg.cg似乎将自己限制为一个核心，但是如果没有，则应考虑将每个工作线程固定在一个特定的内核上，以避免工作进程之间发生争用（您可以在通过执行psutil.Process().cpu_affinity([i])在Linux中，其中i是要绑定的核心的索引。
If the tasks all take variable amounts of time, make sure that you aren't just waiting for one really slow task. 如果所有任务花费的时间不同，请确保您不只是在等待一个非常慢的任务。 You can check this by running ray timeline from the command line and visualizing the result in chrome://tracing (in the Chrome web browser). 您可以通过在命令行中运行ray timeline并在chrome：// tracing中可视化结果（在Chrome网络浏览器中）来进行检查。
Ray uses a shared memory object store to avoid having to serialize and deserialize the K matrix once per worker. Ray使用共享内存对象存储区，以避免每个工作人员必须一次对K矩阵进行序列化和反序列化。 This is an important performance optimization (though it doesn't matter if the tasks take a really long time). 这是重要的性能优化（尽管任务是否花费很长时间并不重要）。 This helps primarily with objects that contain large numpy arrays. 这主要对包含大型numpy数组的对象有帮助。 It doesn't help with arbitrary Python objects. 它对任意Python对象没有帮助。 This is enabled by using the Apache Arrow data layout. 这是通过使用Apache Arrow数据布局启用的。 You can read more in this blog post . 您可以在此博客文章中阅读更多内容。

You can see more in the Ray documentation . 您可以在Ray文档中看到更多信息。 Note that I'm one of the Ray developers. 请注意，我是Ray开发人员之一。

python numpy / scipy的多处理速度慢

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-04-12 06:36:09

python numpy / scipy的多处理速度慢

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-04-12 06:36:09

解决方案1
0 已采纳 2019-04-12 06:36:09