df.groupby(...).apply(...).reset_index() 在 dask dataframe

Question

I want to use two Dask DataFrame to process large csv files, and I need to do a groupby(...).apply(...).reset_index() on one DataFrame before joining it with the other:我想使用两个 Dask DataFrame 来处理大的 csv 文件，我需要在一个 DataFrame 上执行 groupby(...).apply(...).reset_index() ，然后再与另一个合并：

import pandas as pd
import dask.dataframe as dd

dfA = pd.DataFrame({'x': ["x1", "x2", "x2", "x1", "x3", "x2"],
                   'y': ["A", "B", "C", "B", "D", "E"]})
ddfA = dd.from_pandas(dfA, npartitions=2)

gA = ddfA.groupby('x').y.apply(list, meta=('y', 'str')).reset_index()

dfB = pd.DataFrame({'x': ["x1", "x2", "x3"],
                   'z': ["U", "V", "W"]})
ddfB = dd.from_pandas(dfB, npartitions=2)


gA.merge(ddfB, how='left', on='x')

Unfortunately, I have a keyError: 'x'.不幸的是，我有一个 keyError: 'x'。 Can anyone help me to solve this problem?谁能帮我解决这个问题？

Answer 1

Looks like agg(list) helps solve the issue.看起来agg(list)有助于解决问题。

dfA = pd.DataFrame(
    {"x": ["x1", "x2", "x2", "x1", "x3", "x2"], "y": ["A", "B", "C", "B", "D", "E"]}
)
ddfA = dd.from_pandas(dfA, npartitions=2)

gA = ddfA.groupby("x").y.agg(list).reset_index()

dfB = pd.DataFrame({"x": ["x1", "x2", "x3"], "z": ["U", "V", "W"]})
ddfB = dd.from_pandas(dfB, npartitions=2)

print(gA.merge(ddfB, on="x", how="left").compute())

    x          y  z
0  x1     [A, B]  U
1  x2  [B, C, E]  V
2  x3        [D]  W

If one of the DataFrames is smaller than the other, you may want to look into a broadcast join cause that'll be a lot more performant.如果其中一个 DataFrame 比另一个小，您可能需要研究广播连接，因为它的性能会更高。

Answer 2

I don't sure what is your desired output, but if you change the order of the line you can do it:我不确定您想要的 output 是什么，但是如果您更改该行的顺序，您可以这样做：

import pandas as pd
import dask.dataframe as dd

dfA = pd.DataFrame({'x': ["x1", "x2", "x2", "x1", "x3", "x2"],
                   'y': ["A", "B", "C", "B", "D", "E"]})

dfB = pd.DataFrame({'x': ["x1", "x2", "x3"],
                   'z': ["U", "V", "W"]})


gA = dfA.merge(dfB, how='left', on='x')
gA = dd.from_pandas(gA, npartitions=2)
gA


    x   y   z
npartitions=2           
0   object  object  object
3   ... ... ...
5   ... ... ...

df.groupby(...).apply(...).reset_index() 在 dask dataframe

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-10-07 12:25:51

解决方案2
0 2021-10-07 11:01:13

df.groupby(...).apply(...).reset_index() 在 dask dataframe

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-10-07 12:25:51

解决方案2 0 2021-10-07 11:01:13

解决方案1
1 已采纳 2021-10-07 12:25:51

解决方案2
0 2021-10-07 11:01:13