[英]Groupby() and aggregation in pandas
I have a pd.DataFrame
that looks like this: 我有一个
pd.DataFrame
看起来像这样:
In [149]: df
Out[149]:
AMOUNT DATE ORDER_ID UID
0 1001 2014-01-02 101 1
1 1002 2014-01-03 102 3
2 1003 2014-01-04 103 4
3 1004 2014-01-05 104 5
4 1005 2014-01-09 105 5
5 1006 2014-01-07 106 7
6 1007 2014-01-08 107 8
7 1008 2014-01-09 108 5
8 1009 2014-01-10 109 10
9 1500 2014-01-09 110 5
and I want to truncate all rows that correspond to the same UID and DATE to one row and use the sum of the values in the AMOUNT
column for the one row that remains. 我想将与同一UID和DATE对应的所有行截断为一行,并使用
AMOUNT
列中的值之和作为剩余的一行。
In short, the desired output would be: 简而言之,所需的输出将是:
In [149]: df Out[149]: AMOUNT DATE ORDER_ID UID 0 1001 2014-01-02 101 1 1 1002 2014-01-03 102 3 2 1003 2014-01-04 103 4 3 1004 2014-01-05 104 5 4 3513 2014-01-09 105 5 ## <- Rows that previously had index [7,9,4] are now truncated to this one row and the AMOUNT is the sum of of the AMOUNT values of those three rows 5 1006 2014-01-07 106 7 6 1007 2014-01-08 107 8 8 1009 2014-01-10 109 10
In essence, what I want to do is 'aggregate' all rows that correspond to the same user UID and DATE to one row and leave all other rows intact. 本质上,我想做的是将与同一用户UID和DATE对应的所有行“聚合”到一行,而所有其他行保持不变。
What I've tried so far is this: 到目前为止,我尝试过的是:
In [154]: df.groupby(['UID','DATE'])['AMOUNT'].sum()
Out[154]:
UID DATE
1 2014-01-02 1001
3 2014-01-03 1002
4 2014-01-04 1003
5 2014-01-05 1004
2014-01-09 3513
7 2014-01-07 1006
8 2014-01-08 1007
10 2014-01-10 1009
Name: AMOUNT, dtype: int64
but I'm not sure where to start in order to either go back to original df
and remove the 'extra' rows nor how to assign the new sum value of AMOUNT
to the one remaining row. 但是我不确定从哪里开始才能返回到原始
df
并删除“多余”行,也不确定如何将AMOUNT
的新总和值分配给剩余的一行。
Any help is very appreciated! 任何帮助都非常感谢!
df['AMOUNT'] = df.groupby(['UID','DATE'])['AMOUNT'].transform('sum')
df = df.drop_duplicates(['UID', 'DATE'])
df
Out[21]:
AMOUNT DATE ORDER_ID UID
0 1001 2014-01-02 101 1
1 1002 2014-01-03 102 3
2 1003 2014-01-04 103 4
3 1004 2014-01-05 104 5
4 3513 2014-01-09 105 5
5 1006 2014-01-07 106 7
6 1007 2014-01-08 107 8
8 1009 2014-01-10 109 10
I think you can aggregate
sum
and first
: 我想你可以
aggregate
sum
与first
:
print (df.groupby(['UID','DATE'], as_index=False).agg({'AMOUNT': sum, 'ORDER_ID': 'first'}))
UID DATE AMOUNT ORDER_ID
0 1 2014-01-02 1001 101
1 3 2014-01-03 1002 102
2 4 2014-01-04 1003 103
3 5 2014-01-05 1004 104
4 5 2014-01-09 3513 105
5 7 2014-01-07 1006 106
6 8 2014-01-08 1007 107
7 10 2014-01-10 1009 109
Alternatively you can use aggregate
: 另外,您可以使用
aggregate
:
In [10]: df.groupby(['UID', 'DATE']).agg({'AMOUNT': np.sum, 'ORDER_ID': lambda x: x.iloc[0]}).reset_index()
Out[10]:
UID DATE AMOUNT ORDER_ID
0 1 2014-01-02 1001 101
1 3 2014-01-03 1002 102
2 4 2014-01-04 1003 103
3 5 2014-01-05 1004 104
4 5 2014-01-09 3513 105
5 7 2014-01-07 1006 106
6 8 2014-01-08 1007 107
7 10 2014-01-10 1009 109
Assuming you only want the "first" ORDER_ID
from your expected output, ie. 假设您只需要预期输出中的“第一个”
ORDER_ID
,即 lambda x: x.iloc[0]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.