在python中為多個參數並行運行單個函數的最快方法

Question

假設我有一個單一的功能processing 。 我想為多個參數並行運行相同的函數多次，而不是一個接一個地依次運行。

def processing(image_location):
    
    image = rasterio.open(image_location)
    ...
    ...
    return(result)

#calling function serially one after the other with different parameters and saving the results to a variable.
results1 = processing(r'/home/test/image_1.tif')
results2 = processing(r'/home/test/image_2.tif')
results3 = processing(r'/home/test/image_3.tif')

例如，如果我運行delineation(r'/home/test/image_1.tif')然后delineation(r'/home/test/image_1.tif') delineation(r'/home/test/image_2.tif')然后delineation(r'/home/test/image_2.tif') delineation(r'/home/test/image_3.tif') ，如上面的代碼所示，它會一個接一個地依次運行，如果一個函數運行需要5分鍾，那么運行這三個函數需要5x3=15分鍾。 因此，我想知道我是否可以並行/尷尬地並行運行這三個，以便對所有三個不同參數執行該函數只需要 5 分鍾。

幫助我以最快的方式完成這項工作。 該腳本應該能夠利用默認情況下可用的所有資源/CPU/ram 來執行此任務。

Answer 1

您可以使用multiprocessing並行執行函數並將結果保存到results變量：

from multiprocessing.pool import ThreadPool

pool = ThreadPool()
images = [r'/home/test/image_1.tif', r'/home/test/image_2.tif', r'/home/test/image_3.tif']
results = pool.map(delineation, images)

Answer 2

您可能想看看IPython Parallel 。 它允許您輕松地在負載平衡（本地）集群上運行函數。

對於這個小例子，確保你已經安裝了IPython Parallel 、 NumPy和Pillow 。 要運行該示例，您首先需要啟動集群。 要啟動具有四個並行引擎的本地集群，請在終端中鍵入（一個處理器內核一個引擎似乎是一個合理的選擇）：

ipcluster 4

然后您可以運行以下腳本，該腳本在給定目錄中搜索 jpg-images 並計算每個圖像中的像素數：

import ipyparallel as ipp


rc = ipp.Client()
with rc[:].sync_imports():  # import on all engines
    import numpy
    from pathlib import Path
    from PIL import Image


lview = rc.load_balanced_view()  # default load-balanced view
lview.block = True  # block until map() is finished


@lview.parallel()
def count_pixels(fn: Path):
    """Silly function to count the number of pixels in an image file"""
    im = Image.open(fn)
    xx = numpy.asarray(im)
    num_pixels = xx.shape[0] * xx.shape[1]
    return fn.stem, num_pixels


pic_dir = Path('Pictures')
fn_lst = pic_dir.glob('*.jpg')  # list all jpg-files in pic_dir

results = count_pixels.map(fn_lst)  # execute in parallel

for n_, cnt in results:
    print(f"'{n_}' has {cnt} pixels.")

Answer 3

使用multiprocessing庫編寫的另一種方式（請參閱@Alderven 了解不同的功能）。

import multiprocessing as mp

def calculate(input_args):
    result = input_args * 2
    return result

N = mp.cpu_count()
parallel_input = np.arange(0, 100)
print('Amount of CPUs ', N)
print('Amount of iterations ', len(parallel_input))

with mp.Pool(processes=N) as p:
    results = p.map(calculate, list(parallel_input))

results變量將包含一個包含您處理過的數據的列表。 然后你就可以寫了。

Answer 4

我認為最簡單的方法之一是使用joblib ：

import joblib

allJobs = []
allJobs.append(joblib.delayed(processing)(r'/home/test/image_1.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_2.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_3.tif'))

results = joblib.Parallel(n_jobs=joblib.cpu_count(), verbose=10)(allJobs)

在python中為多個參數並行運行單個函數的最快方法

問題描述

4 個解決方案

解決方案1
3 2020-09-11 06:27:49

解決方案2
2 2020-09-17 22:25:47

解決方案3
0 2020-09-21 08:52:10

解決方案4
0 2020-09-21 13:52:23

在python中為多個參數並行運行單個函數的最快方法

問題描述

4 個解決方案

解決方案1 3 2020-09-11 06:27:49

解決方案2 2 2020-09-17 22:25:47

解決方案3 0 2020-09-21 08:52:10

解決方案4 0 2020-09-21 13:52:23

解決方案1
3 2020-09-11 06:27:49

解決方案2
2 2020-09-17 22:25:47

解決方案3
0 2020-09-21 08:52:10

解決方案4
0 2020-09-21 13:52:23