简体   繁体   中英

Loop within dask.array and gil lock

Will the GIL lock significantly decrease performance of the following code?

The function over each block uses a python loop instead of numpy function. I have to use a python loop because of an external library.

Test code:

import numpy as np
import dask.array as da
import dask.sharedict as sharedict
from itertools import product


def block_func(block):
    for i in range(len(block)):  # <--- the python loop ...
        block[i] += 1
    return block


def darr_func(x, name='test'):
    dsk = {}
    for idx in product(*map(range, x.numblocks)):
        dsk[(name,) + idx] = (block_func, (x.name,) + idx)
    dsk2 = sharedict.merge((name, dsk), x.dask)
    return da.Array(dsk2, name, x.chunks, x.dtype)


def main():
    n = 1000
    chunks = 100
    arr = np.arange(n*n).reshape(n, n)
    darr = da.from_array(arr, chunks=chunks)
    result = darr_func(darr)
    print(result.compute())


main()

If that is the case, can setting the context for scheduler help? How to set context for a function over a dask array? I want to use the default dask scheduler for other operations over dask arrays.

From the wiki, I see ways to set scheduler for compute instead of a function:

# As a context manager
>>> with dask.set_options(get=dask.multiprocessing.get):
...     x.sum().compute()

# Set globally
>>> dask.set_options(get=dask.multiprocessing.get)
>>> x.sum().compute()

Python for loops do not release the GIL and so are hard to parallelize with threads. In this case you have a few options

  1. Use a project like Numba or Cython to write for-loop code that releases the GIL
  2. Use a scheduler that splits the computation out to multiple process. My personal recommendation is to use the dask.distributed scheduler locally, which can be done by running the following two lines:

     from dask.distributed import Client client = Client() 

However as always you should profile your code and try a few things. The advice given above depends on many factors. For example Python for loops may not be an issue if the body of the loop releases the GIL.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM