![](/img/trans.png)
[英]'unlink()' does not work in Python's shared_memory on Windows
[英]Is it possible to get dask to work with python multiprocessing shared_memory (BrokenProcessPool error)?
為了加速數據密集型計算,我想從使用 dask delayed/compute 創建的不同進程中訪問 shared_memory 數組。
代碼如下所示(input_data 是要共享的數組:它包含 ints、floats 和 datetime 對象的列,並且總體數據類型為“O”):
import numpy as np
import dask
import dask.multiprocessing
from multiprocessing import shared_memory
def main():
shm = shared_memory.SharedMemory(create=True, size=input_data.nbytes)
shared_array = np.ndarray(input_data.shape, dtype=input_data.dtype, buffer=shm_trades.buf)
shared_array[:] = input_data[:]
dask_collect = []
for i in data_ids:
dask_collect.append(delayed(data_processing)(i, shm.name, input_data.shape, input_data.dtype))
result, = dask.compute(dask_collect, scheduler='processes')
def data_processing(i, shm_name, shm_dim, shm_dtype):
shm = shared_memory.SharedMemory(name=shm_name)
shared_array = np.ndarray(shm_dim, dtype=shm_dtype, buffer=shm.buf)
shared_array_subset = shared_array[shared_array[:, 0] == i]
data_operations(shared_array_subset)
if __name__ == '__main__':
main()
如果我使用 scheduler='single-threaded' 作為 dask.compute 的 kwarg,所有這些都可以正常工作,但是我得到以下關於 scheduler='processes' 的錯誤:
Traceback (most recent call last):
File "C:/my_path/my_script.py", line 274, in <module>
main()
File "C:/my_path/my_script.py", line 207, in main
result, = dask.compute(dask_collect, scheduler='processes')
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\base.py", line 568, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\multiprocessing.py", line 219, in get
result = get_async(
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\local.py", line 506, in get_async
for key, res_info, failed in queue_get(queue).result():
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Process finished with exit code 1
該錯誤發生在到達“data_operations(shared_array_subset)”部分之前。
我是否錯誤地使用了 shared_memory 或 dask?
謝謝!
NumPy 發布了 Python GIL,因此多線程可以為您提供比多處理更好的性能:
result, = dask.compute(dask_collect, scheduler='processes')
您也可以在這里使用 Dask Array(Dask 的 NumPy 並行和分布式實現)而不是 Delayed API。它對 NumPy 有更好的優化。
相關文件:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.