简体   繁体   English

保留一列但在Pandas Groupby和Agg中使用其他列

[英]Keep One Column but Using Other Columns in Pandas Groupby and Agg

I have a goal of grouping a dataset by certain column (identifier) and then perform some customized operations (first sort by date, and then concatenate the status). 我的目标是按特定列(标识符)对数据集进行分组,然后执行一些自定义操作(首先按日期排序,然后连接状态)。

Here is what I have done so far. 这是我到目前为止所做的。

import pandas as pd
from io import StringIO
text = """date  identifier  status
1/1/18  A   Pending
1/1/18  B   Pending
1/1/18  C   Pending
1/2/18  A   Approve
1/2/18  B   Pending
1/2/18  C   Pending
1/3/18  B   Approve
1/3/18  C   Pending"""
text = StringIO(text)
df = pd.read_csv(text, sep="\t") 

# group by identifier 
# within the group, sort by date
# then concatenate by status

def myfunc(df):
    df.sort_values(by="date", ascending=True)
    res = [s[0] for s in df['status']]
    return ''.join(res)

df.groupby(['identifier']).agg(lambda x: myfunc(x))

id  date  status        
A   PA  PA
B   PPA PPA
C   PPP PPP

It seems like the agg will apply the lambda function to each column, and when applied to each column, the whole group will be visible, which lead to status and date are present in the final outcome and share the same output. 看起来agg会将lambda函数应用于每一列,当应用于每一列时,整个组将是可见的,这导致statusdate出现在最终结果中并共享相同的输出。 I can drop the date column afterwards but does not seem ideal . 之后我可以删除日期栏,但似乎并不理想

I tried to specify the status column and then you will lose visibility to other columns that you want to include (for sorting). 我尝试指定状态列,然后您将失去对要包含的其他列的可见性(用于排序)。

def myfunc1(x):
print(x)

df.groupby(['identifier']).agg({'status': lambda x: myfunc1(x)}) 
0    Pending
3    Approve
Name: status, dtype: object
1    Pending
4    Pending
6    Approve
Name: status, dtype: object
2    Pending
5    Pending
7    Pending
Name: status, dtype: object

In summary, how shall I use the agg function properly to get the final outcome 总之,我如何正确使用agg函数来获得最终结果

id   status        
A    PA
B    PPA
C    PPP

IIUC, you can slice first and then just agg IIUC,可以先切片,然后就agg

df['letter'] = df.status.str[0]
df.groupby('identifier').letter.agg(''.join)

identifier
A     PA
B    PPA
C    PPP

But if you really want to use your myfunc , you can correct that by doing 但是如果你真的想使用你的myfunc ,你可以通过这样做来纠正它

  1. Assigning back the sort_values (or removing it entirely): The way it is now, you are sorting but not using the return value of sort_values . 分配sort_values (或完全删除它):现在的方式是,您正在排序但不使用sort_values的返回值。 Thus, nothing is actually being done. 因此,实际上没有做任何事情。 (I believe you should sort_values before going groupby and agg , and not inside agg func . (我相信在进入groupbyagg之前你应该sort_values ,而不是在agg func里面。

  2. Specify you want to agg the status col, and not all cols. 指定您想要agg status col,而不是所有 cols。 You can do that in two ways, as shown below 您可以通过两种方式实现此目的,如下所示

Code would go like: 代码如下:

def myfunc(ser):
    res = [s[0] for s in ser]
    return ''.join(res)

df = df.sort_values('date', ascending=True)
df.groupby(['identifier']).agg({'status': lambda x: myfunc(x)})

or 要么

df.groupby(['identifier']).status.agg(lambda x: myfunc(x))

#same as 
df.groupby(['identifier']).status.agg(myfunc) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM