I have a goal of grouping a dataset by certain column (identifier) and then perform some customized operations (first sort by date, and then concatenate the status).
Here is what I have done so far.
import pandas as pd
from io import StringIO
text = """date identifier status
1/1/18 A Pending
1/1/18 B Pending
1/1/18 C Pending
1/2/18 A Approve
1/2/18 B Pending
1/2/18 C Pending
1/3/18 B Approve
1/3/18 C Pending"""
text = StringIO(text)
df = pd.read_csv(text, sep="\t")
# group by identifier
# within the group, sort by date
# then concatenate by status
def myfunc(df):
df.sort_values(by="date", ascending=True)
res = [s[0] for s in df['status']]
return ''.join(res)
df.groupby(['identifier']).agg(lambda x: myfunc(x))
id date status
A PA PA
B PPA PPA
C PPP PPP
It seems like the agg
will apply the lambda
function to each column, and when applied to each column, the whole group will be visible, which lead to status
and date
are present in the final outcome and share the same output. I can drop the date column afterwards but does not seem ideal .
I tried to specify the status column and then you will lose visibility to other columns that you want to include (for sorting).
def myfunc1(x):
print(x)
df.groupby(['identifier']).agg({'status': lambda x: myfunc1(x)})
0 Pending
3 Approve
Name: status, dtype: object
1 Pending
4 Pending
6 Approve
Name: status, dtype: object
2 Pending
5 Pending
7 Pending
Name: status, dtype: object
In summary, how shall I use the agg function properly to get the final outcome
id status
A PA
B PPA
C PPP
IIUC, you can slice first and then just agg
df['letter'] = df.status.str[0]
df.groupby('identifier').letter.agg(''.join)
identifier
A PA
B PPA
C PPP
But if you really want to use your myfunc
, you can correct that by doing
Assigning back the sort_values
(or removing it entirely): The way it is now, you are sorting but not using the return value of sort_values
. Thus, nothing is actually being done. (I believe you should sort_values
before going groupby
and agg
, and not inside agg func
.
Specify you want to agg
the status
col, and not all cols. You can do that in two ways, as shown below
Code would go like:
def myfunc(ser):
res = [s[0] for s in ser]
return ''.join(res)
df = df.sort_values('date', ascending=True)
df.groupby(['identifier']).agg({'status': lambda x: myfunc(x)})
or
df.groupby(['identifier']).status.agg(lambda x: myfunc(x))
#same as
df.groupby(['identifier']).status.agg(myfunc)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.