简体   繁体   中英

Understanding Dask scheduler and client

Here's something very basic that I need help understanding:

>>> from dask.distributed import Client
>>> import dask.bag as db
>>> c = Client()
>>> dsk = {'x': (lambda x: x + 1, 1), 'y': ['x', 'x'], 'z': (lambda x: db.from_sequence(x).to_dataframe(), 'y')}
>>> c.get(dsk, 'z')
Exception Exception: Exception('Client not running.  Status: None',) in <bound method Future.__del__ of <Future: status: cancelled, key: ('take-155ca6bd6582b78bbeb95ad86fa1d081', 0)>> ignored
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/distributed/client.py", line 1764, in get
    results = self.gather(packed)
  File "/usr/local/lib/python2.7/dist-packages/distributed/client.py", line 1263, in gather
    direct=direct)
  File "/usr/local/lib/python2.7/dist-packages/distributed/client.py", line 489, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/distributed/utils.py", line 234, in sync
    six.reraise(*error[0])
  File "/usr/local/lib/python2.7/dist-packages/distributed/utils.py", line 223, in f
    result[0] = yield make_coro()
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/distributed/client.py", line 1156, in _gather
    traceback)
  File "<stdin>", line 1, in <lambda>
  File "/usr/local/lib/python2.7/dist-packages/dask/bag/core.py", line 1160, in to_dataframe
    head = self.take(1)[0]
  File "/usr/local/lib/python2.7/dist-packages/dask/bag/core.py", line 1040, in take
    return tuple(b.compute())
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 97, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 204, in compute
    results = get(dsk, keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/distributed/client.py", line 1760, in get
    resources=resources)
  File "/usr/local/lib/python2.7/dist-packages/distributed/client.py", line 1729, in _graph_to_futures
    'resources': resources})
  File "/usr/local/lib/python2.7/dist-packages/distributed/client.py", line 584, in _send_to_scheduler
    raise Exception("Client not running.  Status: %s" % self.status)
Exception: Client not running.  Status: None

However, if I remove the to_dataframe() part, then it runs to completion, I can then call compute() on the obtained result.

From looking at Dask examples, it is not clear whether and how one should use Dask datastructures. Maybe I'm not supposed to use them in task definitions? Or maybe I have to start the scheduler differently?


OK, with the help from https://stackoverflow.com/a/44193980/5691066 I discovered that if I do:

>>> c = Client(processes=False)

Then it works the way I'd expect. I'd still appreciate the explanation though. Is this because this code will start another Python process which will somehow not "see" the already existing client and use some new instance of that client, which hasn't been started?

You do not need to construct graphs manually. You can use dask collections like bag and dataframe normally in your python process and they will send computations to the dask.distributed cluster on their own:

>>> from dask.distributed import Client
>>> import dask.bag as db
>>> c = Client()
>>> b = db.from_sequence([1, 2])
>>> df = b.to_dataframe()
>>> df.compute()

The dictionaries you're creating are internal data structures. You don't need to use them. Using them without deeply understanding how dask works is likely to result in errors.

If you need to write complex task graphs then you should consider using dask.delayed or some of the more advanced real-time functionality of the dask.distributed concurrent.futures interface

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM