简体   繁体   English

Pandas groupby 两列获取最早日期

[英]Pandas groupby two columns get earliest date

This is the dataset:这是数据集:

` `

data = {'id': ['1','1','1','1','2','2','2','2','2','3','3','3','3','3','3','3'],
                      'status': ['Active','Active','Active','Pending Action','Pending Action','Pending Action','Active','Pending Action','Active','Draft','Active','Draft','Draft','Draft','Active','Draft'],
                      'calc_date_id':['05/07/2022','07/06/2022','31/08/2021','01/07/2021','20/11/2022','25/10/2022','02/04/2022','28/02/2022','01/07/2021','23/06/2022','15/06/2022','07/04/2022','09/11/2022','18/08/2020','19/03/2020','17/01/202']
        }                

df = pd.DataFrame(data)
#to datetime
df['calc_date_id'] = pd.to_datetime(df['calc_date_id'])

` `

How do I get the first date in the last time the status change by id?我如何通过id获取最后一次状态更改的第一个日期?

I tried sorting by date and groupby with id and status and keep="first" but I got:我尝试按日期排序,并使用 id 和 status 以及 keep="first" 进行分组,但我得到了:

Groupbing by status按状态分组

Also tried也试过

df_mt_date.loc[df_mt_date.groupby(['id',' status'])['calc_date_id'].idxmin()]

Instead of that I'd like to preserve the order by date obtaining only the first time where the id has changed status for the last time (not all of the history).取而代之的是,我想按日期保留顺序,仅获取 ID 上次更改状态的第一次(不是所有历史记录)。

This is the desired output这是所需的 output

I'm running out of ideas, I'll appreciate any suggestion我的想法用完了,我会很感激任何建议

Thank you谢谢

Try:尝试:

df["desired_output"] = df.groupby("id")["status"].transform(
    lambda x: df.loc[x.index, "calc_date_id"][(x != x.shift(-1)).idxmax()]
)
print(df)

Prints:印刷:

   id          status calc_date_id desired_output
0   1          Active   2022-07-05     2021-08-31
1   1          Active   2022-06-07     2021-08-31
2   1          Active   2021-08-31     2021-08-31
3   1  Pending Action   2021-07-01     2021-08-31
4   2  Pending Action   2022-11-20     2022-10-25
5   2  Pending Action   2022-10-25     2022-10-25
6   2          Active   2022-04-02     2022-10-25
7   2  Pending Action   2022-02-28     2022-10-25
8   2          Active   2021-07-01     2022-10-25
9   3           Draft   2022-06-23     2022-06-23
10  3          Active   2022-06-15     2022-06-23
11  3           Draft   2022-04-07     2022-06-23
12  3           Draft   2022-11-09     2022-06-23
13  3           Draft   2020-08-18     2022-06-23
14  3          Active   2020-03-19     2022-06-23
15  3           Draft   2020-01-17     2022-06-23

From your desired output I see, that the group "boundaries" are points where particular value of status column occurs for the first time, regardless of id column.从你想要的 output 我看到,组“边界”是状态列的特定值第一次出现的点,无论id列如何。

To indicate first occurrences of values in status column, run:要指示状态列中值的首次出现,请运行:

wrk = df.groupby('status', group_keys=False).apply(
    lambda grp: grp.assign(isFirst=grp.index[0] == grp.index))
wrk.isFirst = wrk.isFirst.cumsum()

To see the result, print wrk and look at isFirst column.要查看结果,请打印wrk并查看isFirst列。

Then, to generate the result, run:然后,要生成结果,请运行:

result = wrk.groupby('isFirst', group_keys=False).apply(
    lambda grp: grp.assign(desired_output=grp.calc_date_id.min()))\
    .drop(columns='isFirst')

Note the terminating drop to drop now unnecessary isFirst column.请注意终止删除以删除现在不必要的isFirst列。

The result, for your data sample, is:对于您的数据样本,结果是:

   id          status calc_date_id desired_output
0   1          Active   2022-07-05     2021-08-31
1   1          Active   2022-06-07     2021-08-31
2   1          Active   2021-08-31     2021-08-31
3   1  Pending Action   2021-07-01     2021-07-01
4   2  Pending Action   2022-11-20     2021-07-01
5   2  Pending Action   2022-10-25     2021-07-01
6   2          Active   2022-04-02     2021-07-01
7   2  Pending Action   2022-02-28     2021-07-01
8   2          Active   2021-07-01     2021-07-01
9   3           Draft   2022-06-23     2020-03-19
10  3          Active   2022-06-15     2020-03-19
11  3           Draft   2022-04-07     2020-03-19
12  3           Draft   2022-11-09     2020-03-19
13  3           Draft   2020-08-18     2020-03-19
14  3          Active   2020-03-19     2020-03-19
15  3           Draft   2022-01-17     2020-03-19

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM