简体   繁体   English

Dask 数据帧中的多个聚合用户定义函数

[英]Multiple aggregation user defined functions in Dask dataframe

I'm processing a data set using Dask (considering it doesn't fit in memory) and I want to group the instances with a different aggregating function depending on the column and it's type.我正在使用 Dask 处理数据集(考虑到它不适合内存),我想根据列及其类型使用不同的聚合函数对实例进行分组。

Dask has a set of default aggregation functions for numerical data types, but not for strings/objects. Dask 有一组用于数值数据类型的默认聚合函数,但没有用于字符串/对象。 Is there a way to implement a user defined aggregation function for strings somewhat similar to the example below?有没有办法为字符串实现用户定义的聚合函数,有点类似于下面的例子?

atts_to_group = {'A', 'B'}
agg_fn = {
  'C': 'mean'  #int
  'D': 'concatenate_fn1'  #string - No default fn for strings - Doesn't work
  'E': 'concatenate_fn2'  #string
}
ddf = ddf.groupby(atts_to_group).agg(agg_fn).compute().reset_index()

At this point I'm able to read the whole data set in memory upon dropping irrelevant columns/rows, but I'd prefer continuing the processing in Dask considering it's faster performing the required operations.在这一点上,我可以在删除不相关的列/行时读取内存中的整个数据集,但我更喜欢在 Dask 中继续处理,因为它可以更快地执行所需的操作。

Edit: Tried adding a custom function directly onto the dictionary:编辑:尝试将自定义函数直接添加到字典中:

def custom_concat(df):
    ...
    return df_concatd

agg_fn = {
  'C': 'mean'  #int
  'D': custom_concat(df)
}

-------------------------------------------------------
ValueError: unknown aggregate Dask DataFrame Structure:

Realised Dask provides with an Aggregation data structure . Realized Dask 提供了聚合数据结构 The custom aggregation can be done as follows:自定义聚合可以按如下方式完成:

# Concatenates the strings and separates them using ","
custom_concat = dd.Aggregation('custom_sum', lambda x: ",".join(str(x)), lambda x0: ",".join(str(x0)))
custom_concat_E = ...

atts_to_group = {'A', 'B'}
agg_fn = {
  'C': 'mean'  #int
  'D': custom_concat_D
  'E': custom_concat_E
}
ddf = ddf.groupby(atts_to_group).agg(agg_fn).compute().reset_index()

This can also be done with Dataframe.apply for a less verbose solution这也可以使用Dataframe.apply来完成,以获得更简洁的解决方案

def agg_fn(x):
    return pd.Series(
        dict(
            C = x['C'].mean(), # int
            D = "{%s}" % ', '.join(x['D']), # string (concat strings)
            E = ...
        )
    )

ddf = ddf.groupby(atts_to_group).apply(agg_fn).compute().reset_index

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM