简体   繁体   English

Pandas DataFrame 使用多列聚合 function

[英]Pandas DataFrame aggregate function using multiple columns

Is there a way to write an aggregation function as is used in DataFrame.agg method, that would have access to more than one column of the data that is being aggregated?有没有办法编写一个聚合 function ,就像DataFrame.agg方法中使用的那样,可以访问多个正在聚合的数据列? Typical use cases would be weighted average, weighted standard deviation funcs.典型的用例是加权平均、加权标准差函数。

I would like to be able to write something like我希望能够写出类似的东西

def wAvg(c, w):
    return ((c * w).sum() / w.sum())

df = DataFrame(....) # df has columns c and w, i want weighted average
                     # of c using w as weight.
df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...

Yes;是的; use the .apply(...) function, which will be called on each sub- DataFrame .使用.apply(...)函数,它将在每个子DataFrameDataFrame For example:例如:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)

It is possible to return any number of aggregated values from a groupby object with apply .可以使用apply从 groupby 对象返回任意数量的聚合值。 Simply, return a Series and the index values will become the new column names.简单地,返回一个系列,索引值将成为新的列名。

Let's see a quick example:让我们看一个简单的例子:

df = pd.DataFrame({'group':['a','a','b','b'],
                   'd1':[5,10,100,30],
                   'd2':[7,1,3,20],
                   'weights':[.2,.8, .4, .6]},
                 columns=['group', 'd1', 'd2', 'weights'])
df

  group   d1  d2  weights
0     a    5   7      0.2
1     a   10   1      0.8
2     b  100   3      0.4
3     b   30  20      0.6

Define a custom function that will be passed to apply .定义一个将传递给apply的自定义函数。 It implicitly accepts a DataFrame - meaning the data parameter is a DataFrame.它隐式地接受一个 DataFrame - 这意味着data参数是一个 DataFrame。 Notice how it uses multiple columns, which is not possible with the agg groupby method:请注意它如何使用多列,这在agg groupby 方法中是不可能的:

def weighted_average(data):
    d = {}
    d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
    d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
    return pd.Series(d)

Call the groupby apply method with our custom function:使用我们的自定义函数调用 groupby apply方法:

df.groupby('group').apply(weighted_average)

       d1_wa  d2_wa
group              
a        9.0    2.2
b       58.0   13.2

You can get better performance by precalculating the weighted totals into new DataFrame columns as explained in other answers and avoid using apply altogether.您可以通过将加权总数预先计算到新的 DataFrame 列中来获得更好的性能,如其他答案中所述,并避免完全使用apply

My solution is similar to Nathaniel's solution, only it's for a single column and I don't deep-copy the entire data frame each time, which could be prohibitively slow.我的解决方案类似于 Nathaniel 的解决方案,只是它是针对单个列的,而且我不会每次都深度复制整个数据帧,这可能会非常慢。 The performance gain over the solution groupby(...).apply(...) is about 100x(!)解决方案 groupby(...).apply(...) 的性能提升大约是 100x(!)

def weighted_average(df, data_col, weight_col, by_col):
    df['_data_times_weight'] = df[data_col] * df[weight_col]
    df['_weight_where_notnull'] = df[weight_col] * pd.notnull(df[data_col])
    g = df.groupby(by_col)
    result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result

The following (based on Wes McKinney' answer) accomplishes exactly what I was looking for.以下(基于 Wes McKinney 的回答)完全符合我的要求。 I'd be happy to learn if there's a simpler way of doing this within pandas .我很高兴知道在pandas是否有更简单的方法来做到这一点。

def wavg_func(datacol, weightscol):
    def wavg(group):
        dd = group[datacol]
        ww = group[weightscol] * 1.0
        return (dd * ww).sum() / ww.sum()
    return wavg


def df_wavg(df, groupbycol, weightscol):
    grouped = df.groupby(groupbycol)
    df_ret = grouped.agg({weightscol:sum})
    datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]]
    for dcol in datacols:
        try:
            wavg_f = wavg_func(dcol, weightscol)
            df_ret[dcol] = grouped.apply(wavg_f)
        except TypeError:  # handle non-numeric columns
            df_ret[dcol] = grouped.agg({dcol:min})
    return df_ret

The function df_wavg() returns a dataframe that's grouped by the "groupby" column, and that returns the sum of the weights for the weights column.函数df_wavg()返回按“groupby”列分组的数据帧,并返回权重列的权重总和。 Other columns are either the weighted averages or, if non-numeric, the min() function is used for aggregation.其他列是加权平均值,或者,如果是非数字,则min()函数用于聚合。

I do this a lot and found the following quite handy:我经常这样做,发现以下内容非常方便:

def weighed_average(grp):
    return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum()
df.groupby('SOME_COL').apply(weighed_average)

This will compute the weighted average of all the numerical columns in the df and drop non-numeric ones.这将计算df中所有数字列的加权平均值,并删除非数字列。

Accomplishing this via groupby(...).apply(...) is non-performant.通过groupby(...).apply(...)实现这一点是无效的。 Here's a solution that I use all the time (essentially using kalu's logic).这是我一直使用的解决方案(主要使用 kalu 的逻辑)。

def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs):
   """
    :param values: column(s) to take the average of
    :param weights_col: column to weight on
    :param group_args: args to pass into groupby (e.g. the level you want to group on)
    :param group_kwargs: kwargs to pass into groupby
    :return: pandas.Series or pandas.DataFrame
    """

    if isinstance(values, str):
        values = [values]

    ss = []
    for value_col in values:
        df = self.copy()
        prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights)
        weights_name = 'weights_{w}'.format(w=weights)

        df[prod_name] = df[value_col] * df[weights]
        df[weights_name] = df[weights].where(~df[prod_name].isnull())
        df = df.groupby(*groupby_args, **groupby_kwargs).sum()
        s = df[prod_name] / df[weights_name]
        s.name = value_col
        ss.append(s)
    df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0]
    return df

pandas.DataFrame.grouped_weighted_average = grouped_weighted_average

Here's a solution which has the following benefits:这是一个具有以下优点的解决方案:

  1. You don't need to define a function in advance你不需要提前定义一个函数
  2. You can use it within a pipe (since it's using lambda)您可以在管道中使用它(因为它使用的是 lambda)
  3. You can name the resulting column您可以命名结果列

:

df.groupby('group')
  .apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights)})

You can also use the same code to perform multiple aggregations:您还可以使用相同的代码来执行多个聚合:

df.groupby('group')
  .apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights), 
'regular_average': np.average(x.data)}))

You can implement this function in the following way:您可以通过以下方式实现此功能:

(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

For example:例如:

df = pd.DataFrame({'groups': [1, 1, 2, 2], 'c': [3, 3, 4, 4], 'w': [5, 5, 6, 6]})
(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

Result:结果:

groups
1    3.0
2    4.0
dtype: float64

Adding to Wes MacKinney answer, this will rename the aggregated column:添加到 Wes MacKinney 答案,这将重命名聚合列:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg).reset_index().rename(columns={0 : "wavg"})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM