简体   繁体   中英

Keep One Column but Using Other Columns in Pandas Groupby and Agg

I have a goal of grouping a dataset by certain column (identifier) and then perform some customized operations (first sort by date, and then concatenate the status).

Here is what I have done so far.

import pandas as pd
from io import StringIO
text = """date  identifier  status
1/1/18  A   Pending
1/1/18  B   Pending
1/1/18  C   Pending
1/2/18  A   Approve
1/2/18  B   Pending
1/2/18  C   Pending
1/3/18  B   Approve
1/3/18  C   Pending"""
text = StringIO(text)
df = pd.read_csv(text, sep="\t") 

# group by identifier 
# within the group, sort by date
# then concatenate by status

def myfunc(df):
    df.sort_values(by="date", ascending=True)
    res = [s[0] for s in df['status']]
    return ''.join(res)

df.groupby(['identifier']).agg(lambda x: myfunc(x))

id  date  status        
A   PA  PA
B   PPA PPA
C   PPP PPP

It seems like the agg will apply the lambda function to each column, and when applied to each column, the whole group will be visible, which lead to status and date are present in the final outcome and share the same output. I can drop the date column afterwards but does not seem ideal .

I tried to specify the status column and then you will lose visibility to other columns that you want to include (for sorting).

def myfunc1(x):
print(x)

df.groupby(['identifier']).agg({'status': lambda x: myfunc1(x)}) 
0    Pending
3    Approve
Name: status, dtype: object
1    Pending
4    Pending
6    Approve
Name: status, dtype: object
2    Pending
5    Pending
7    Pending
Name: status, dtype: object

In summary, how shall I use the agg function properly to get the final outcome

id   status        
A    PA
B    PPA
C    PPP

IIUC, you can slice first and then just agg

df['letter'] = df.status.str[0]
df.groupby('identifier').letter.agg(''.join)

identifier
A     PA
B    PPA
C    PPP

But if you really want to use your myfunc , you can correct that by doing

  1. Assigning back the sort_values (or removing it entirely): The way it is now, you are sorting but not using the return value of sort_values . Thus, nothing is actually being done. (I believe you should sort_values before going groupby and agg , and not inside agg func .

  2. Specify you want to agg the status col, and not all cols. You can do that in two ways, as shown below

Code would go like:

def myfunc(ser):
    res = [s[0] for s in ser]
    return ''.join(res)

df = df.sort_values('date', ascending=True)
df.groupby(['identifier']).agg({'status': lambda x: myfunc(x)})

or

df.groupby(['identifier']).status.agg(lambda x: myfunc(x))

#same as 
df.groupby(['identifier']).status.agg(myfunc) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM