简体   繁体   中英

Groupby() and aggregation in pandas

I have a pd.DataFrame that looks like this:

In [149]: df
Out[149]: 
   AMOUNT       DATE  ORDER_ID  UID
0    1001 2014-01-02       101    1
1    1002 2014-01-03       102    3
2    1003 2014-01-04       103    4
3    1004 2014-01-05       104    5
4    1005 2014-01-09       105    5
5    1006 2014-01-07       106    7
6    1007 2014-01-08       107    8
7    1008 2014-01-09       108    5
8    1009 2014-01-10       109   10
9    1500 2014-01-09       110    5

and I want to truncate all rows that correspond to the same UID and DATE to one row and use the sum of the values in the AMOUNT column for the one row that remains.

In short, the desired output would be:

 In [149]: df Out[149]: AMOUNT DATE ORDER_ID UID 0 1001 2014-01-02 101 1 1 1002 2014-01-03 102 3 2 1003 2014-01-04 103 4 3 1004 2014-01-05 104 5 4 3513 2014-01-09 105 5 ## <- Rows that previously had index [7,9,4] are now truncated to this one row and the AMOUNT is the sum of of the AMOUNT values of those three rows 5 1006 2014-01-07 106 7 6 1007 2014-01-08 107 8 8 1009 2014-01-10 109 10 

In essence, what I want to do is 'aggregate' all rows that correspond to the same user UID and DATE to one row and leave all other rows intact.

What I've tried so far is this:

In [154]: df.groupby(['UID','DATE'])['AMOUNT'].sum()
Out[154]: 
UID  DATE      
1    2014-01-02    1001
3    2014-01-03    1002
4    2014-01-04    1003
5    2014-01-05    1004
     2014-01-09    3513
7    2014-01-07    1006
8    2014-01-08    1007
10   2014-01-10    1009
Name: AMOUNT, dtype: int64

but I'm not sure where to start in order to either go back to original df and remove the 'extra' rows nor how to assign the new sum value of AMOUNT to the one remaining row.

Any help is very appreciated!

df['AMOUNT'] = df.groupby(['UID','DATE'])['AMOUNT'].transform('sum')
df = df.drop_duplicates(['UID', 'DATE'])
df
Out[21]: 
   AMOUNT       DATE  ORDER_ID  UID
0    1001 2014-01-02       101    1
1    1002 2014-01-03       102    3
2    1003 2014-01-04       103    4
3    1004 2014-01-05       104    5
4    3513 2014-01-09       105    5
5    1006 2014-01-07       106    7
6    1007 2014-01-08       107    8
8    1009 2014-01-10       109   10

I think you can aggregate sum and first :

print (df.groupby(['UID','DATE'], as_index=False).agg({'AMOUNT': sum, 'ORDER_ID': 'first'}))

   UID        DATE  AMOUNT  ORDER_ID
0    1  2014-01-02    1001       101
1    3  2014-01-03    1002       102
2    4  2014-01-04    1003       103
3    5  2014-01-05    1004       104
4    5  2014-01-09    3513       105
5    7  2014-01-07    1006       106
6    8  2014-01-08    1007       107
7   10  2014-01-10    1009       109

Alternatively you can use aggregate :

In [10]: df.groupby(['UID', 'DATE']).agg({'AMOUNT': np.sum, 'ORDER_ID': lambda x: x.iloc[0]}).reset_index()
Out[10]: 
   UID       DATE  AMOUNT  ORDER_ID
0    1 2014-01-02    1001       101
1    3 2014-01-03    1002       102
2    4 2014-01-04    1003       103
3    5 2014-01-05    1004       104
4    5 2014-01-09    3513       105
5    7 2014-01-07    1006       106
6    8 2014-01-08    1007       107
7   10 2014-01-10    1009       109

Assuming you only want the "first" ORDER_ID from your expected output, ie. lambda x: x.iloc[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM