Dask 应用自定义功能

Question

I am experimenting with Dask, but I encountered a problem while using apply after grouping.我正在尝试使用 Dask，但在分组后使用apply遇到问题。

I have a Dask DataFrame with a large number of rows.我有一个包含大量行的 Dask DataFrame。 Let's consider for example the following例如，让我们考虑以下情况

N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)

I want to bin the values of col_1 and I follow the solution from here我想对col_1的值进行 bin 处理，并按照此处的解决方案进行操作

bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)

where在哪里

def test_f(df,col,bins,labels):
    return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))

and this works as I expect it to.这正如我所期望的那样工作。

Now I want to take the median value in each bin (taken fromhere )现在我想取每个 bin 的中值（取自这里）

median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()

Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.有 10 个 bin，我希望median有 10 行，但实际上有 80 个。数据帧有 8 个分区，所以我猜想应用程序以某种方式单独处理每个分区。

However, If I want the mean and use mean但是，如果我想要平均值并使用mean

median = ddf2.groupby('bin_num')['col_1'].mean().compute()

it works and the output has 10 rows.它工作正常，输出有 10 行。

The question is then: what am I doing wrong that is preventing apply from operating as mean ?接下来的问题是：我在做什么错误，导致无法apply从作为经营mean ？

Answer 1

Maybe this warning is the key ( Dask doc: SeriesGroupBy.apply ) :也许这个警告是关键（ Dask doc：SeriesGroupBy.apply ）：

Pandas' groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Pandas 的 groupby-apply 可用于应用任意函数，包括导致每组一行的聚合。 Dask's groupby-apply will apply func once to each partition-group pair , so when func is a reduction you'll end up with one row per partition-group pair. Dask 的 groupby-apply将对每个分区组对应用 func 一次，因此当 func 是一个减少时，您最终会得到每个分区组对一行。 To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.要使用 Dask 应用自定义聚合，请使用 dask.dataframe.groupby.Aggregation。

Answer 2

You are right!你是对的！ I was able to reproduce your problem on Dask 2.11.0.我能够在 Dask 2.11.0 上重现您的问题。 The good news is that there's a solution!好消息是有一个解决方案！ It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). Dask groupby 问题似乎与类别类型（pandas.core.dtypes.dtypes.CategoricalDtype）有关。 By casting the category column to another column type (float, int, str), then the groupby will work correctly.通过将 category 列转换为另一种列类型（float、int、str），groupby 将正常工作。

Here's your code that I copied:这是我复制的您的代码：

import dask.dataframe as dd
import pandas as pd
import numpy as np


def test_f(df, col, bins, labels):
    return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))

N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)

bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)

print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())

which prints out the problem you mentioned打印出你提到的问题

bin_num
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
       ...   
5    0.550844
6    0.651036
7    0.751220
8         NaN
9         NaN
Name: col_1, Length: 80, dtype: float64

Here's my solution:这是我的解决方案：

ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")

print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())

which printed:其中打印：

bin_num
9    0.951369
2    0.249150
1    0.149563
0    0.049897
3    0.347906
8    0.847819
4    0.449029
5    0.550608
6    0.652778
7    0.749922
Name: col_1, dtype: float64

@MRocklin or @TomAugspurger Would you be able to create a fix for this in a new release? @MRocklin 或 @TomAugspurger 您能否在新版本中为此创建修复程序？ I think there is sufficient reproducible code here.我认为这里有足够的可重现代码。 Thanks for all your hard work.感谢您的辛勤工作。 I love Dask and use it every day ;)我喜欢 Dask 并且每天都使用它 ;)

Dask 应用自定义功能

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-03-17 15:25:42

解决方案2
2 2020-04-15 15:42:20

Dask 应用自定义功能

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-03-17 15:25:42

解决方案2 2 2020-04-15 15:42:20

解决方案1
2 已采纳 2020-03-17 15:25:42

解决方案2
2 2020-04-15 15:42:20