简体   繁体   中英

Is it possible to get dask to work with python multiprocessing shared_memory (BrokenProcessPool error)?

To speed up a data-intensive computation, I would like to access a shared_memory array from within different processes created with dask delayed/compute.

The code looks as follows (input_data is the array to be shared: it contains columns of ints, floats, and datetime objects, and has overall dtype 'O'):

import numpy as np
import dask
import dask.multiprocessing
from multiprocessing import shared_memory


def main():

    shm = shared_memory.SharedMemory(create=True, size=input_data.nbytes)
    shared_array = np.ndarray(input_data.shape, dtype=input_data.dtype, buffer=shm_trades.buf)
    shared_array[:] = input_data[:]

    dask_collect = []
    for i in data_ids:
        dask_collect.append(delayed(data_processing)(i, shm.name, input_data.shape, input_data.dtype))
    result, = dask.compute(dask_collect, scheduler='processes')


def data_processing(i, shm_name, shm_dim, shm_dtype):

    shm = shared_memory.SharedMemory(name=shm_name)
    shared_array = np.ndarray(shm_dim, dtype=shm_dtype, buffer=shm.buf)
    shared_array_subset = shared_array[shared_array[:, 0] == i]

    data_operations(shared_array_subset)


if __name__ == '__main__':
    main()

All of this works fine if I use scheduler='single-threaded' as a kwarg to dask.compute, but I get the following error with scheduler='processes':

Traceback (most recent call last):
  File "C:/my_path/my_script.py", line 274, in <module>
    main()
  File "C:/my_path/my_script.py", line 207, in main
    result, = dask.compute(dask_collect, scheduler='processes')
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\base.py", line 568, in compute
    results = schedule(dsk, keys, **kwargs)
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\multiprocessing.py", line 219, in get
    result = get_async(
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\site-packages\dask\local.py", line 506, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "C:\Users\me\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Process finished with exit code 1

The error occurs before reaching the "data_operations(shared_array_subset)" part.

Am I using shared_memory or dask incorrectly?

Thanks!

NumPy releases the Python GIL, so multithreading can give you better performance than multiprocessing:

result, = dask.compute(dask_collect, scheduler='processes')

You can also use Dask Array (Dask's parallel and distributed implementation of NumPy) here instead of the Delayed API. It has better optimizations for NumPy.

Relevant docs:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM