[英]Dask dataframe compute failed
I'm playing around with Python Dask.我在玩 Python Dask。 I followed their dataframe example jupyter notebook but failed at the step when converting a dask dataframe to pandas data frame by calling the compute()
function. I followed their dataframe example jupyter notebook but failed at the step when converting a dask dataframe to pandas data frame by calling the compute()
function. Would anyone please advise what I did wrong?有人可以告诉我我做错了什么吗?
Code:代码:
### Cell0
!pip install "dask[complete]"
!pip install pandas
### Cell1
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
df
### Cell2
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3
### Cell3
computed_df = df3.compute()
type(computed_df)
Error raised when executing computed_df = df3.compute()
in cell 3.在单元格 3 中执行computed_df = df3.compute()
时引发错误。
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-6da1eef50c1d> in <module>
----> 1 computed_df = df3.compute()
2 type(computed_df)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(self, **kwargs)
283 dask.base.compute
284 """
--> 285 (result,) = compute(self, traverse=False, **kwargs)
286 return result
287
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(*args, **kwargs)
559 )
560
--> 561 dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
562 keys, postcomputes = [], []
563 for x in collections:
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in collections_to_dsk(collections, optimize_graph, optimizations, **kwargs)
335 for opt, val in groups.items():
336 dsk, keys = _extract_graph_and_keys(val)
--> 337 dsk = opt(dsk, keys, **kwargs)
338
339 for opt in optimizations:
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize(dsk, keys, **kwargs)
20 else:
21 # Perform Blockwise optimizations for HLG input
---> 22 dsk = optimize_dataframe_getitem(dsk, keys=keys)
23 dsk = optimize_blockwise(dsk, keys=keys)
24 dsk = fuse_roots(dsk, keys=keys)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize_dataframe_getitem(dsk, keys)
103 # Project columns and update blocks
104 old = layers[k]
--> 105 new = old.project_columns(columns)[0]
106 if new.name != old.name:
107 columns = list(columns)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/layers.py in project_columns(self, columns)
941 # Apply column projection in IO function
942 try:
--> 943 io_func = self.io_func.project_columns(list(columns))
944 except AttributeError:
945 io_func = self.io_func
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in project_columns(self, columns)
87 func = copy.deepcopy(self)
88 func.columns = columns
---> 89 func.dtypes = {c: self.dtypes[c] for c in columns}
90 return func
91
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in <dictcomp>(.0)
87 func = copy.deepcopy(self)
88 func.columns = columns
---> 89 func.dtypes = {c: self.dtypes[c] for c in columns}
90 return func
91
KeyError: 'gt-d5f81fc97f91e68c389fc34631419acc'
Interesting, I can reproduce this bug with:有趣的是,我可以通过以下方式重现此错误:
python=3.9.4
pandas=1.2.4
dask=2021.5.0
distributed=2021.5.0
Specifically, the error occurs in this step:具体来说,这一步会出现错误:
df2 = df[df.y > 0]
I raised an issue on GitHub , but in the meantime downgrading dask version to 2021.4.0
resolves the problem (the computed result will show):我在 GitHub 上提出了一个问题,但同时将 dask 版本降级到2021.4.0
解决了这个问题(计算结果将显示):
python=3.9.4
pandas=1.2.4
dask=2021.4.1
distributed=2021.4.1
(note Python here is 3.9, which seems to be the case in your environment) (注意Python这里是3.9,你的环境好像是这样)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.