简体   繁体   English

Groupby()和熊猫聚合

[英]Groupby() and aggregation in pandas

I have a pd.DataFrame that looks like this: 我有一个pd.DataFrame看起来像这样:

In [149]: df
Out[149]: 
   AMOUNT       DATE  ORDER_ID  UID
0    1001 2014-01-02       101    1
1    1002 2014-01-03       102    3
2    1003 2014-01-04       103    4
3    1004 2014-01-05       104    5
4    1005 2014-01-09       105    5
5    1006 2014-01-07       106    7
6    1007 2014-01-08       107    8
7    1008 2014-01-09       108    5
8    1009 2014-01-10       109   10
9    1500 2014-01-09       110    5

and I want to truncate all rows that correspond to the same UID and DATE to one row and use the sum of the values in the AMOUNT column for the one row that remains. 我想将与同一UID和DATE对应的所有行截断为一行,并使用AMOUNT列中的值之和作为剩余的一行。

In short, the desired output would be: 简而言之,所需的输出将是:

 In [149]: df Out[149]: AMOUNT DATE ORDER_ID UID 0 1001 2014-01-02 101 1 1 1002 2014-01-03 102 3 2 1003 2014-01-04 103 4 3 1004 2014-01-05 104 5 4 3513 2014-01-09 105 5 ## <- Rows that previously had index [7,9,4] are now truncated to this one row and the AMOUNT is the sum of of the AMOUNT values of those three rows 5 1006 2014-01-07 106 7 6 1007 2014-01-08 107 8 8 1009 2014-01-10 109 10 

In essence, what I want to do is 'aggregate' all rows that correspond to the same user UID and DATE to one row and leave all other rows intact. 本质上,我想做的是将与同一用户UID和DATE对应的所有行“聚合”到一行,而所有其他行保持不变。

What I've tried so far is this: 到目前为止,我尝试过的是:

In [154]: df.groupby(['UID','DATE'])['AMOUNT'].sum()
Out[154]: 
UID  DATE      
1    2014-01-02    1001
3    2014-01-03    1002
4    2014-01-04    1003
5    2014-01-05    1004
     2014-01-09    3513
7    2014-01-07    1006
8    2014-01-08    1007
10   2014-01-10    1009
Name: AMOUNT, dtype: int64

but I'm not sure where to start in order to either go back to original df and remove the 'extra' rows nor how to assign the new sum value of AMOUNT to the one remaining row. 但是我不确定从哪里开始才能返回到原始df并删除“多余”行,也不确定如何将AMOUNT的新总和值分配给剩余的一行。

Any help is very appreciated! 任何帮助都非常感谢!

df['AMOUNT'] = df.groupby(['UID','DATE'])['AMOUNT'].transform('sum')
df = df.drop_duplicates(['UID', 'DATE'])
df
Out[21]: 
   AMOUNT       DATE  ORDER_ID  UID
0    1001 2014-01-02       101    1
1    1002 2014-01-03       102    3
2    1003 2014-01-04       103    4
3    1004 2014-01-05       104    5
4    3513 2014-01-09       105    5
5    1006 2014-01-07       106    7
6    1007 2014-01-08       107    8
8    1009 2014-01-10       109   10

I think you can aggregate sum and first : 我想你可以aggregate sumfirst

print (df.groupby(['UID','DATE'], as_index=False).agg({'AMOUNT': sum, 'ORDER_ID': 'first'}))

   UID        DATE  AMOUNT  ORDER_ID
0    1  2014-01-02    1001       101
1    3  2014-01-03    1002       102
2    4  2014-01-04    1003       103
3    5  2014-01-05    1004       104
4    5  2014-01-09    3513       105
5    7  2014-01-07    1006       106
6    8  2014-01-08    1007       107
7   10  2014-01-10    1009       109

Alternatively you can use aggregate : 另外,您可以使用aggregate

In [10]: df.groupby(['UID', 'DATE']).agg({'AMOUNT': np.sum, 'ORDER_ID': lambda x: x.iloc[0]}).reset_index()
Out[10]: 
   UID       DATE  AMOUNT  ORDER_ID
0    1 2014-01-02    1001       101
1    3 2014-01-03    1002       102
2    4 2014-01-04    1003       103
3    5 2014-01-05    1004       104
4    5 2014-01-09    3513       105
5    7 2014-01-07    1006       106
6    8 2014-01-08    1007       107
7   10 2014-01-10    1009       109

Assuming you only want the "first" ORDER_ID from your expected output, ie. 假设您只需要预期输出中的“第一个” ORDER_ID ,即 lambda x: x.iloc[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM