I have a pd.DataFrame
that looks like this:
In [149]: df
Out[149]:
AMOUNT DATE ORDER_ID UID
0 1001 2014-01-02 101 1
1 1002 2014-01-03 102 3
2 1003 2014-01-04 103 4
3 1004 2014-01-05 104 5
4 1005 2014-01-09 105 5
5 1006 2014-01-07 106 7
6 1007 2014-01-08 107 8
7 1008 2014-01-09 108 5
8 1009 2014-01-10 109 10
9 1500 2014-01-09 110 5
and I want to truncate all rows that correspond to the same UID and DATE to one row and use the sum of the values in the AMOUNT
column for the one row that remains.
In short, the desired output would be:
In [149]: df Out[149]: AMOUNT DATE ORDER_ID UID 0 1001 2014-01-02 101 1 1 1002 2014-01-03 102 3 2 1003 2014-01-04 103 4 3 1004 2014-01-05 104 5 4 3513 2014-01-09 105 5 ## <- Rows that previously had index [7,9,4] are now truncated to this one row and the AMOUNT is the sum of of the AMOUNT values of those three rows 5 1006 2014-01-07 106 7 6 1007 2014-01-08 107 8 8 1009 2014-01-10 109 10
In essence, what I want to do is 'aggregate' all rows that correspond to the same user UID and DATE to one row and leave all other rows intact.
What I've tried so far is this:
In [154]: df.groupby(['UID','DATE'])['AMOUNT'].sum()
Out[154]:
UID DATE
1 2014-01-02 1001
3 2014-01-03 1002
4 2014-01-04 1003
5 2014-01-05 1004
2014-01-09 3513
7 2014-01-07 1006
8 2014-01-08 1007
10 2014-01-10 1009
Name: AMOUNT, dtype: int64
but I'm not sure where to start in order to either go back to original df
and remove the 'extra' rows nor how to assign the new sum value of AMOUNT
to the one remaining row.
Any help is very appreciated!
df['AMOUNT'] = df.groupby(['UID','DATE'])['AMOUNT'].transform('sum')
df = df.drop_duplicates(['UID', 'DATE'])
df
Out[21]:
AMOUNT DATE ORDER_ID UID
0 1001 2014-01-02 101 1
1 1002 2014-01-03 102 3
2 1003 2014-01-04 103 4
3 1004 2014-01-05 104 5
4 3513 2014-01-09 105 5
5 1006 2014-01-07 106 7
6 1007 2014-01-08 107 8
8 1009 2014-01-10 109 10
I think you can aggregate
sum
and first
:
print (df.groupby(['UID','DATE'], as_index=False).agg({'AMOUNT': sum, 'ORDER_ID': 'first'}))
UID DATE AMOUNT ORDER_ID
0 1 2014-01-02 1001 101
1 3 2014-01-03 1002 102
2 4 2014-01-04 1003 103
3 5 2014-01-05 1004 104
4 5 2014-01-09 3513 105
5 7 2014-01-07 1006 106
6 8 2014-01-08 1007 107
7 10 2014-01-10 1009 109
Alternatively you can use aggregate
:
In [10]: df.groupby(['UID', 'DATE']).agg({'AMOUNT': np.sum, 'ORDER_ID': lambda x: x.iloc[0]}).reset_index()
Out[10]:
UID DATE AMOUNT ORDER_ID
0 1 2014-01-02 1001 101
1 3 2014-01-03 1002 102
2 4 2014-01-04 1003 103
3 5 2014-01-05 1004 104
4 5 2014-01-09 3513 105
5 7 2014-01-07 1006 106
6 8 2014-01-08 1007 107
7 10 2014-01-10 1009 109
Assuming you only want the "first" ORDER_ID
from your expected output, ie. lambda x: x.iloc[0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.