dask数据框中的df.groupby（…）.apply（…）函数

Question

我正在使用Python dask处理大型csv面板数据集（15 + GB），并且需要执行groupby(...).apply(...)函数来删除每天每一只股票的最后观察结果。 我的数据集看起来像

 stock     date     time   spread  time_diff 
  VOD      01-01    9:05    0.01     0:07     
  VOD      01-01    9:12    0.03     0:52     
  VOD      01-01   10:04    0.02     0:11
  VOD      01-01   10:15    0.01     0:10     
  VOD      01-01   10:25    0.03     0:39  
  VOD      01-01   11:04    0.02    22:00 
  VOD      01-02    9:04    0.02     0:05
  ...       ...     ...     ....     ...
  BAT      01-01    13:05   0.04    10:02
  BAT      01-02    9:07    0.05     0:03
  BAT      01-02    9:10    0.06     0:04
  ...       ...     ...     ....     ...

如果数据框位于大熊猫中，则可以通过

df_new=df_have.groupby(['stock','date'], as_index=False).apply(lambda x: x.iloc[:-1])

此代码对pandas df效果很好。 但是，我无法在dask数据框中执行此代码。 我做了以下尝试。

ddf_new=ddf_have.groupby(['stock','date']).apply(lambda x: x.iloc[:-1]).compute()

要么

ddf_new=ddf_have.groupby(['stock','date']).apply(lambda x: x.iloc[:-1], meta=('stock' : 'f8')).compute()

要么

ddf_new=ddf_have.groupby(['stock','date']).apply(lambda x: x.iloc[:-1], meta=meta).compute()

不幸的是，他们都没有工作。 谁能帮助我为dask dataframe获取正确的代码？ 谢谢

Answer 1

我认为对于您的特定情况，问题在于您分配的meta 。 这应该工作。

import pandas as pd
import numpy as np
import dask.dataframe as dd

dates = pd.date_range(start='2019-01-01',
                      end='2019-12-31',
                      freq='5T')

out = []
for stock in list("abcdefgh"):
    df = pd.DataFrame({"stock":[stock]*len(dates),
                       "date":dates,
                       "spread":np.random.randn(len(dates))})
    df["time_diff"] = df["date"].diff().shift(-1)
    df["time"] = df["date"].dt.time.astype(str)
    df["date"] = df["date"].dt.date.astype(str)
    out.append(df)
df = pd.concat(out, ignore_index=True)

del out

ddf = dd.from_pandas(df, npartitions=4)

out = ddf.groupby(['stock','date']).apply(lambda x: x[:-1],
                                          meta={"stock":"str",
                                                "date":"str",
                                                "spread":"f8",
                                                "time_diff":"str",
                                                "time":"str"})
out = out.compute().reset_index(drop=True)

如果您可以按工作日很好地对文件进行分区并将其保存在to_parquet ，则可以使用map_partitions而不是apply来获得更好的性能。

dask数据框中的df.groupby（…）.apply（…）函数

问题描述

1 个解决方案

解决方案1
0 2019-09-16 13:15:04

dask数据框中的df.groupby（…）.apply（…）函数

问题描述

1 个解决方案

解决方案1 0 2019-09-16 13:15:04

解决方案1
0 2019-09-16 13:15:04