具有用户定义函数Pandas的Groupby

Question

I understand that passing a function as a group key calls the function once per index value with the return values being used as the group names. 我知道将函数作为组键传递每个索引值调用一次函数，返回值用作组名。 What I can't figure out is how to call the function on column values. 我无法弄清楚的是如何在列值上调用函数。

So I can do this: 所以我可以这样做：

people = pd.DataFrame(np.random.randn(5, 5), 
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
def GroupFunc(x):
    if len(x) > 3:
        return 'Group1'
    else:
        return 'Group2'

people.groupby(GroupFunc).sum()

This splits the data into two groups, one of which has index values of length 3 or less, and the other with length three or more. 这将数据分成两组，其中一组的索引值为3或更小，另一组的长度为3或更多。 But how can I pass one of the column values? 但是我如何传递其中一个列值？ So for example if column d value for each index point is greater than 1. I realise I could just do the following: 因此，例如，如果每个索引点的列d值大于1.我意识到我可以执行以下操作：

people.groupby(people.a > 1).sum()

But I want to know how to do this in a user defined function for future reference. 但我想知道如何在用户定义的函数中执行此操作以供将来参考。

Something like: 就像是：

def GroupColFunc(x):
if x > 1:
    return 'Group1'
else:
    return 'Group2'

But how do I call this? 但是我怎么称呼这个？ I tried 我试过了

people.groupby(GroupColFunc(people.a))

and similar variants but this does not work. 和类似的变体，但这不起作用。

How do I pass the column values to the function? 如何将列值传递给函数？ How would I pass multiple column values eg to group on whether people.a > people.b for example? 我如何传递多个列值，例如分组是否people.a> people.b？

Answer 1

To group by a > 1, you can define your function like: 要按> 1分组，您可以定义您的函数，如：

>>> def GroupColFunc(df, ind, col):
...     if df[col].loc[ind] > 1:
...         return 'Group1'
...     else:
...         return 'Group2'
...

An then call it like 然后称之为

>>> people.groupby(lambda x: GroupColFunc(people, x, 'a')).sum()
               a         b         c         d        e
Group2 -2.384614 -0.762208  3.359299 -1.574938 -2.65963

Or you can do it only with anonymous function: 或者你只能使用匿名函数：

>>> people.groupby(lambda x: 'Group1' if people['b'].loc[x] > people['a'].loc[x] else 'Group2').sum()
               a         b         c         d         e
Group1 -3.280319 -0.007196  1.525356  0.324154 -1.002439
Group2  0.895705 -0.755012  1.833943 -1.899092 -1.657191

As said in documentation , you can also group by passing Series providing a label -> group name mapping: 如文档中所述，您还可以通过传递系列提供标签 - >组名称映射进行分组：

>>> mapping = np.where(people['b'] > people['a'], 'Group1', 'Group2')
>>> mapping
Joe       Group2
Steve     Group1
Wes       Group2
Jim       Group1
Travis    Group1
dtype: string48
>>> people.groupby(mapping).sum()
               a         b         c         d         e
Group1 -3.280319 -0.007196  1.525356  0.324154 -1.002439
Group2  0.895705 -0.755012  1.833943 -1.899092 -1.657191

具有用户定义函数Pandas的Groupby

问题描述

1 个解决方案

解决方案1
33 已采纳 2013-10-27 08:28:57

具有用户定义函数Pandas的Groupby

问题描述

1 个解决方案

解决方案1 33 已采纳 2013-10-27 08:28:57

解决方案1
33 已采纳 2013-10-27 08:28:57