python numpy / scipy的多处理速度慢

Question

我有一个非常耗费处理器的任务，需要13-20个小时才能完成，具体取决于机器。 似乎是通过多处理库进行并行化的明显选择。 问题是...我产生的进程越多，相同代码的速度就越慢。

每次迭代的时间（即运行sparse.linalg.cg所需的时间）：

183s 1程序

245s 2个过程

312s 3个过程

383s 4进程

当然，虽然2个进程每次迭代花费的时间略多于30％，但它同时执行2个进程，因此速度仍要快一些。 但是我不希望实际的数学运算本身会变慢！ 这些计时器要等到任何开销的多处理之后才开始。

这是我的代码的精简版。 问题行是sparse.linalg.cg之一。 （我尝试过使用MKL和OpenBLAS之类的方法，并强迫它们在单个线程中运行。还尝试了手动生成进程而不是使用池。没有运气。）

def do_the_thing_partial(iteration: int, iter_size: float, outQ : multiprocessing.Queue, L: int, D: int, qP: int, elec_ind: np.ndarray, Ic: int, ubi2: int,
                 K : csc_matrix, t: np.ndarray, dip_ind_t: np.ndarray, conds: np.ndarray, hx: float, dstr: np.ndarray):
    range_start = ceil(iteration * iter_size)
    range_end = ceil((iteration + 1) * iter_size)

    for rr in range(range_start, range_end):
        # do some things (like generate F from rr)
        Vfull=sparse.linalg.cg(K,F,tol=1e-11,maxiter=1200)[0] #Solve the system
        # do more things
        outQ.put((rr, Vfull))


def do_the_thing(L: int, D: int, qP: int, elec_ind: np.ndarray, Ic: int, ubi2: int,
                 K : csc_matrix, t: np.ndarray, dip_ind_t: np.ndarray, conds: np.ndarray, hx: float, dstr: np.ndarray):
    num_cores = cpu_count()
    iterations_per_process = (L-1) / num_cores  # 257 / 8 ?

    outQ = multiprocessing.Queue()

    pool = multiprocessing.Pool(processes=num_cores)

    [pool.apply_async(do_the_thing_partial,
                      args=(i, iterations_per_process, outQ, L, D, qP, elec_ind, Ic, ubi2, K, t, dip_ind_t, conds, hx, dstr),
                      callback=None)
     for i in range(num_cores)]

    pool.close()
    pool.join()

    for res in outQ:
        # combine results and return here

我是在做错什么，还是由于其自身的优化而无法并行化sparse.linalg.cg？

谢谢！

Answer 1

这是一个如何使用Ray （并行和分布式Python库）加速的示例。 在执行pip install ray （在Linux或MacOS上），您可以运行以下代码。

在笔记本电脑上运行下面的串行计算版本（例如，执行scipy.sparse.linalg.cg(K, F, tol=1e-11, maxiter=100) 20次）需要33秒 。 定时下面的代码来启动20个任务并获得结果需要8.7秒 。 我的笔记本电脑有4个物理核心，因此几乎是4倍的加速 。

我对您的代码进行了很多更改，但我认为我保留了其本质。

import numpy as np
import ray
import scipy.sparse
import scipy.sparse.linalg

# Consider passing in 'num_cpus=psutil.cpu_count(logical=True)'.
ray.init()

num_elements = 10**7
dim = 10**4

data = np.random.normal(size=num_elements)
row_indices = np.random.randint(0, dim, size=num_elements)
col_indices = np.random.randint(0, dim, size=num_elements)

K = scipy.sparse.csc_matrix((data, (row_indices, col_indices)))

@ray.remote
def solve_system(K, F):
    # Solve the system.
    return scipy.sparse.linalg.cg(K, F, tol=1e-11, maxiter=100)[0]

# Store the array in shared memory first. This is optional. That is, you could
# directly pass in K, however, this should speed it up because this way it only
# needs to serialize K once. On the other hand, if you use a different value of
# "K" for each call to "solve_system", then this doesn't help.
K_id = ray.put(K)

# Time the code below!

result_ids = []
for _ in range(20):
    F = np.random.normal(size=dim)
    result_ids.append(solve_system.remote(K_id, F))

# Run a bunch of tasks in parallel. Ray will schedule one per core.
results = ray.get(result_ids)

调用ray.init()启动Ray工作进程。 对solve_system.remote的调用将任务提交给工作人员。 尽管您可以通过@ray.remote(num_cpus=2)指定一个特定的任务需要更多的资源（或更少的资源），但是Ray会默认为每个内核调度一个。 您还可以指定GPU资源和其他自定义资源。

对solve_system.remote的调用会立即返回一个代表最终计算结果的ID，而对ray.get的调用将获取这些ID并检索实际的计算结果（因此ray.get将等待任务完成执行）。

一些注意事项

在我的笔记本电脑上， scipy.sparse.linalg.cg似乎将自己限制为一个核心，但是如果没有，则应考虑将每个工作线程固定在一个特定的内核上，以避免工作进程之间发生争用（您可以在通过执行psutil.Process().cpu_affinity([i])在Linux中，其中i是要绑定的核心的索引。
如果所有任务花费的时间不同，请确保您不只是在等待一个非常慢的任务。 您可以通过在命令行中运行ray timeline并在chrome：// tracing中可视化结果（在Chrome网络浏览器中）来进行检查。
Ray使用共享内存对象存储区，以避免每个工作人员必须一次对K矩阵进行序列化和反序列化。 这是重要的性能优化（尽管任务是否花费很长时间并不重要）。 这主要对包含大型numpy数组的对象有帮助。 它对任意Python对象没有帮助。 这是通过使用Apache Arrow数据布局启用的。 您可以在此博客文章中阅读更多内容。

您可以在Ray文档中看到更多信息。 请注意，我是Ray开发人员之一。

python numpy / scipy的多处理速度慢

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-04-12 06:36:09

python numpy / scipy的多处理速度慢

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-04-12 06:36:09

解决方案1
0 已采纳 2019-04-12 06:36:09