我如何实际获取dask来计算延迟或基于dask-container的结果列表？

Question

I have a trivially parallelizable task of computing results independently for many tables split across many files. 我有一个简单的可并行化的计算结果任务，分别用于许多文件中的许多表。 I can construct delayed or dask.dataframe lists (and have also tried with, eg a dict), and I cannot get all of the results to compute (I can get individual results from a dask graph style dictionary using .get() , but again can't compute all results easily). 我可以构造延迟或dask.dataframe列表（并且还尝试过，例如dict），我无法获得所有计算结果（我可以使用.get()从dask图形样式字典中获得单独的结果，但是再次无法轻松计算所有结果）。 Here's a minimal example: 这是一个最小的例子：

>>> df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)
>>> numbers = [df['a'].mean() for _ in range(2)]
>>> dd.compute(numbers)
([<dask.dataframe.core.Scalar at 0x7f91d1523978>,
  <dask.dataframe.core.Scalar at 0x7f91d1523a58>],)

Similarly: 同理：

>>> from dask import delayed
>>> @delayed
... def mean(data):
...     sum(data) / len(data)
>>> delayed_numbers = [mean([1,2]) for _ in range(2)]
>>> dask.compute(delayed_numbers)
([Delayed('mean-0e0a0dea-fa92-470d-b06e-b639fbaacae3'),
  Delayed('mean-89f2e361-03b6-4279-bef7-572ceac76324')],)

I would like to get [3, 3], which is what I would expect based on the delayed collections docs . 我想得到[3,3]，这是我期望的基于延迟集合文档。

For my real problem, I would actually like to compute on tables in an HDF5 file, but given that I can get that to work with dask.get() I'm pretty sure I'm specifying my deferred / dask dataframe step right already. 对于我真正的问题，我实际上想在HDF5文件中的表上进行计算，但考虑到我可以使用dask.get()我很确定我已经指定了deferred / dask数据帧步骤。

I would be interested in a solution that directly results in a dictionary, but I can also just return a list of (key, value) tuples to dict() , which is probably not a huge performance hit. 我会对直接导致字典的解决方案感兴趣，但我也可以将（key，value）元组列表返回给dict() ，这可能不是一个巨大的性能损失。

Answer 1

Compute takes many collections as separate arguments. Compute将许多集合作为单独的参数。 Try splatting out your arguments as follows: 尝试按如下方式展开您的参数：

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)

In [4]: numbers = [df['a'].mean() for _ in range(2)]

In [5]: dd.compute(*numbers)  # note the *
Out[5]: (1.5, 1.5)

Or, as might be more common: 或者，可能更常见：

In [6]: dd.compute(df.a.mean(), df.a.std())
Out[6]: (1.5, 0.707107)

我如何实际获取dask来计算延迟或基于dask-container的结果列表？

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-05-24 01:11:22

我如何实际获取dask来计算延迟或基于dask-container的结果列表？

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-05-24 01:11:22

解决方案1
4 已采纳 2016-05-24 01:11:22