简体   繁体   English

我如何实际获取dask来计算延迟或基于dask-container的结果列表?

[英]How do I actually get dask to compute a list of delayed or dask-container-based results?

I have a trivially parallelizable task of computing results independently for many tables split across many files. 我有一个简单的可并行化的计算结果任务,分别用于许多文件中的许多表。 I can construct delayed or dask.dataframe lists (and have also tried with, eg a dict), and I cannot get all of the results to compute (I can get individual results from a dask graph style dictionary using .get() , but again can't compute all results easily). 我可以构造延迟或dask.dataframe列表(并且还尝试过,例如dict),我无法获得所有计算结果(我可以使用.get()从dask图形样式字典中获得单独的结果,但是再次无法轻松计算所有结果)。 Here's a minimal example: 这是一个最小的例子:

>>> df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)
>>> numbers = [df['a'].mean() for _ in range(2)]
>>> dd.compute(numbers)
([<dask.dataframe.core.Scalar at 0x7f91d1523978>,
  <dask.dataframe.core.Scalar at 0x7f91d1523a58>],)

Similarly: 同理:

>>> from dask import delayed
>>> @delayed
... def mean(data):
...     sum(data) / len(data)
>>> delayed_numbers = [mean([1,2]) for _ in range(2)]
>>> dask.compute(delayed_numbers)
([Delayed('mean-0e0a0dea-fa92-470d-b06e-b639fbaacae3'),
  Delayed('mean-89f2e361-03b6-4279-bef7-572ceac76324')],)

I would like to get [3, 3], which is what I would expect based on the delayed collections docs . 我想得到[3,3],这是我期望的基于延迟集合文档

For my real problem, I would actually like to compute on tables in an HDF5 file, but given that I can get that to work with dask.get() I'm pretty sure I'm specifying my deferred / dask dataframe step right already. 对于我真正的问题,我实际上想在HDF5文件中的表上进行计算,但考虑到我可以使用dask.get()我很确定我已经指定了deferred / dask数据帧步骤。

I would be interested in a solution that directly results in a dictionary, but I can also just return a list of (key, value) tuples to dict() , which is probably not a huge performance hit. 我会对直接导致字典的解决方案感兴趣,但我也可以将(key,value)元组列表返回给dict() ,这可能不是一个巨大的性能损失。

Compute takes many collections as separate arguments. Compute将许多集合作为单独的参数。 Try splatting out your arguments as follows: 尝试按如下方式展开您的参数:

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)

In [4]: numbers = [df['a'].mean() for _ in range(2)]

In [5]: dd.compute(*numbers)  # note the *
Out[5]: (1.5, 1.5)

Or, as might be more common: 或者,可能更常见:

In [6]: dd.compute(df.a.mean(), df.a.std())
Out[6]: (1.5, 0.707107)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM