python numpy / scipy的多處理速度慢

Question

我有一個非常耗費處理器的任務，需要13-20個小時才能完成，具體取決於機器。 似乎是通過多處理庫進行並行化的明顯選擇。 問題是...我產生的進程越多，相同代碼的速度就越慢。

每次迭代的時間（即運行sparse.linalg.cg所需的時間）：

183s 1程序

245s 2個過程

312s 3個過程

383s 4進程

當然，雖然2個進程每次迭代花費的時間略多於30％，但它同時執行2個進程，因此速度仍要快一些。 但是我不希望實際的數學運算本身會變慢！ 這些計時器要等到任何開銷的多處理之后才開始。

這是我的代碼的精簡版。 問題行是sparse.linalg.cg之一。 （我嘗試過使用MKL和OpenBLAS之類的方法，並強迫它們在單個線程中運行。還嘗試了手動生成進程而不是使用池。沒有運氣。）

def do_the_thing_partial(iteration: int, iter_size: float, outQ : multiprocessing.Queue, L: int, D: int, qP: int, elec_ind: np.ndarray, Ic: int, ubi2: int,
                 K : csc_matrix, t: np.ndarray, dip_ind_t: np.ndarray, conds: np.ndarray, hx: float, dstr: np.ndarray):
    range_start = ceil(iteration * iter_size)
    range_end = ceil((iteration + 1) * iter_size)

    for rr in range(range_start, range_end):
        # do some things (like generate F from rr)
        Vfull=sparse.linalg.cg(K,F,tol=1e-11,maxiter=1200)[0] #Solve the system
        # do more things
        outQ.put((rr, Vfull))


def do_the_thing(L: int, D: int, qP: int, elec_ind: np.ndarray, Ic: int, ubi2: int,
                 K : csc_matrix, t: np.ndarray, dip_ind_t: np.ndarray, conds: np.ndarray, hx: float, dstr: np.ndarray):
    num_cores = cpu_count()
    iterations_per_process = (L-1) / num_cores  # 257 / 8 ?

    outQ = multiprocessing.Queue()

    pool = multiprocessing.Pool(processes=num_cores)

    [pool.apply_async(do_the_thing_partial,
                      args=(i, iterations_per_process, outQ, L, D, qP, elec_ind, Ic, ubi2, K, t, dip_ind_t, conds, hx, dstr),
                      callback=None)
     for i in range(num_cores)]

    pool.close()
    pool.join()

    for res in outQ:
        # combine results and return here

我是在做錯什么，還是由於其自身的優化而無法並行化sparse.linalg.cg？

謝謝！

Answer 1

這是一個如何使用Ray （並行和分布式Python庫）加速的示例。 在執行pip install ray （在Linux或MacOS上），您可以運行以下代碼。

在筆記本電腦上運行下面的串行計算版本（例如，執行scipy.sparse.linalg.cg(K, F, tol=1e-11, maxiter=100) 20次）需要33秒 。 定時下面的代碼來啟動20個任務並獲得結果需要8.7秒 。 我的筆記本電腦有4個物理核心，因此幾乎是4倍的加速 。

我對您的代碼進行了很多更改，但我認為我保留了其本質。

import numpy as np
import ray
import scipy.sparse
import scipy.sparse.linalg

# Consider passing in 'num_cpus=psutil.cpu_count(logical=True)'.
ray.init()

num_elements = 10**7
dim = 10**4

data = np.random.normal(size=num_elements)
row_indices = np.random.randint(0, dim, size=num_elements)
col_indices = np.random.randint(0, dim, size=num_elements)

K = scipy.sparse.csc_matrix((data, (row_indices, col_indices)))

@ray.remote
def solve_system(K, F):
    # Solve the system.
    return scipy.sparse.linalg.cg(K, F, tol=1e-11, maxiter=100)[0]

# Store the array in shared memory first. This is optional. That is, you could
# directly pass in K, however, this should speed it up because this way it only
# needs to serialize K once. On the other hand, if you use a different value of
# "K" for each call to "solve_system", then this doesn't help.
K_id = ray.put(K)

# Time the code below!

result_ids = []
for _ in range(20):
    F = np.random.normal(size=dim)
    result_ids.append(solve_system.remote(K_id, F))

# Run a bunch of tasks in parallel. Ray will schedule one per core.
results = ray.get(result_ids)

調用ray.init()啟動Ray工作進程。 對solve_system.remote的調用將任務提交給工作人員。 盡管您可以通過@ray.remote(num_cpus=2)指定一個特定的任務需要更多的資源（或更少的資源），但是Ray會默認為每個內核調度一個。 您還可以指定GPU資源和其他自定義資源。

對solve_system.remote的調用會立即返回一個代表最終計算結果的ID，而對ray.get的調用將獲取這些ID並檢索實際的計算結果（因此ray.get將等待任務完成執行）。

一些注意事項

在我的筆記本電腦上， scipy.sparse.linalg.cg似乎將自己限制為一個核心，但是如果沒有，則應考慮將每個工作線程固定在一個特定的內核上，以避免工作進程之間發生爭用（您可以在通過執行psutil.Process().cpu_affinity([i])在Linux中，其中i是要綁定的核心的索引。
如果所有任務花費的時間不同，請確保您不只是在等待一個非常慢的任務。 您可以通過在命令行中運行ray timeline並在chrome：// tracing中可視化結果（在Chrome網絡瀏覽器中）來進行檢查。
Ray使用共享內存對象存儲區，以避免每個工作人員必須一次對K矩陣進行序列化和反序列化。 這是重要的性能優化（盡管任務是否花費很長時間並不重要）。 這主要對包含大型numpy數組的對象有幫助。 它對任意Python對象沒有幫助。 這是通過使用Apache Arrow數據布局啟用的。 您可以在此博客文章中閱讀更多內容。

您可以在Ray文檔中看到更多信息。 請注意，我是Ray開發人員之一。

python numpy / scipy的多處理速度慢

問題描述

1 個解決方案

解決方案1
0 已采納 2019-04-12 06:36:09

python numpy / scipy的多處理速度慢

問題描述

1 個解決方案

解決方案1 0 已采納 2019-04-12 06:36:09

解決方案1
0 已采納 2019-04-12 06:36:09