簡體   English   中英

是否有可能讓 dask 與 python 多處理共享內存(BrokenProcessPool 錯誤)一起工作?

[英]Is it possible to get dask to work with python multiprocessing shared_memory (BrokenProcessPool error)?

為了加速數據密集型計算,我想從使用 dask delayed/compute 創建的不同進程中訪問 shared_memory 數組。

代碼如下所示(input_data 是要共享的數組:它包含 ints、floats 和 datetime 對象的列,並且總體數據類型為“O”):

import numpy as np
import dask
import dask.multiprocessing
from multiprocessing import shared_memory


def main():

    shm = shared_memory.SharedMemory(create=True, size=input_data.nbytes)
    shared_array = np.ndarray(input_data.shape, dtype=input_data.dtype, buffer=shm_trades.buf)
    shared_array[:] = input_data[:]

    dask_collect = []
    for i in data_ids:
        dask_collect.append(delayed(data_processing)(i, shm.name, input_data.shape, input_data.dtype))
    result, = dask.compute(dask_collect, scheduler='processes')


def data_processing(i, shm_name, shm_dim, shm_dtype):

    shm = shared_memory.SharedMemory(name=shm_name)
    shared_array = np.ndarray(shm_dim, dtype=shm_dtype, buffer=shm.buf)
    shared_array_subset = shared_array[shared_array[:, 0] == i]

    data_operations(shared_array_subset)


if __name__ == '__main__':
    main()

如果我使用 scheduler='single-threaded' 作為 dask.compute 的 kwarg,所有這些都可以正常工作,但是我得到以下關於 scheduler='processes' 的錯誤:

Traceback (most recent call last):
  File "C:/my_path/my_script.py", line 274, in <module>
    main()
  File "C:/my_path/my_script.py", line 207, in main
    result, = dask.compute(dask_collect, scheduler='processes')
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\base.py", line 568, in compute
    results = schedule(dsk, keys, **kwargs)
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\multiprocessing.py", line 219, in get
    result = get_async(
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\local.py", line 506, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Process finished with exit code 1

該錯誤發生在到達“data_operations(shared_array_subset)”部分之前。

我是否錯誤地使用了 shared_memory 或 dask?

謝謝!

NumPy 發布了 Python GIL,因此多線程可以為您提供比多處理更好的性能:

result, = dask.compute(dask_collect, scheduler='processes')

您也可以在這里使用 Dask Array(Dask 的 NumPy 並行和分布式實現)而不是 Delayed API。它對 NumPy 有更好的優化。

相關文件:

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM