简体   繁体   English

Dask dataframe 计算失败

[英]Dask dataframe compute failed

I'm playing around with Python Dask.我在玩 Python Dask。 I followed their dataframe example jupyter notebook but failed at the step when converting a dask dataframe to pandas data frame by calling the compute() function. I followed their dataframe example jupyter notebook but failed at the step when converting a dask dataframe to pandas data frame by calling the compute() function. Would anyone please advise what I did wrong?有人可以告诉我我做错了什么吗?

Code:代码:

### Cell0
!pip install "dask[complete]"
!pip install pandas

### Cell1 
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
df

### Cell2 
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3

### Cell3
computed_df = df3.compute()
type(computed_df)

Error raised when executing computed_df = df3.compute() in cell 3.在单元格 3 中执行computed_df = df3.compute()时引发错误。

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-6da1eef50c1d> in <module>
----> 1 computed_df = df3.compute()
      2 type(computed_df)

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(self, **kwargs)
    283         dask.base.compute
    284         """
--> 285         (result,) = compute(self, traverse=False, **kwargs)
    286         return result
    287 

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(*args, **kwargs)
    559     )
    560 
--> 561     dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
    562     keys, postcomputes = [], []
    563     for x in collections:

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in collections_to_dsk(collections, optimize_graph, optimizations, **kwargs)
    335         for opt, val in groups.items():
    336             dsk, keys = _extract_graph_and_keys(val)
--> 337             dsk = opt(dsk, keys, **kwargs)
    338 
    339             for opt in optimizations:

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize(dsk, keys, **kwargs)
     20     else:
     21         # Perform Blockwise optimizations for HLG input
---> 22         dsk = optimize_dataframe_getitem(dsk, keys=keys)
     23         dsk = optimize_blockwise(dsk, keys=keys)
     24         dsk = fuse_roots(dsk, keys=keys)

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize_dataframe_getitem(dsk, keys)
    103         # Project columns and update blocks
    104         old = layers[k]
--> 105         new = old.project_columns(columns)[0]
    106         if new.name != old.name:
    107             columns = list(columns)

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/layers.py in project_columns(self, columns)
    941             # Apply column projection in IO function
    942             try:
--> 943                 io_func = self.io_func.project_columns(list(columns))
    944             except AttributeError:
    945                 io_func = self.io_func

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in project_columns(self, columns)
     87         func = copy.deepcopy(self)
     88         func.columns = columns
---> 89         func.dtypes = {c: self.dtypes[c] for c in columns}
     90         return func
     91 

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in <dictcomp>(.0)
     87         func = copy.deepcopy(self)
     88         func.columns = columns
---> 89         func.dtypes = {c: self.dtypes[c] for c in columns}
     90         return func
     91 

KeyError: 'gt-d5f81fc97f91e68c389fc34631419acc'

Interesting, I can reproduce this bug with:有趣的是,我可以通过以下方式重现此错误:

python=3.9.4
pandas=1.2.4
dask=2021.5.0
distributed=2021.5.0

Specifically, the error occurs in this step:具体来说,这一步会出现错误:

df2 = df[df.y > 0]

I raised an issue on GitHub , but in the meantime downgrading dask version to 2021.4.0 resolves the problem (the computed result will show):在 GitHub 上提出了一个问题,但同时将 dask 版本降级到2021.4.0解决了这个问题(计算结果将显示):

python=3.9.4
pandas=1.2.4
dask=2021.4.1
distributed=2021.4.1

(note Python here is 3.9, which seems to be the case in your environment) (注意Python这里是3.9,你的环境好像是这样)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM