Pandas groupby 兩列獲取最早日期

Question

`

data = {'id': ['1','1','1','1','2','2','2','2','2','3','3','3','3','3','3','3'],
                      'status': ['Active','Active','Active','Pending Action','Pending Action','Pending Action','Active','Pending Action','Active','Draft','Active','Draft','Draft','Draft','Active','Draft'],
                      'calc_date_id':['05/07/2022','07/06/2022','31/08/2021','01/07/2021','20/11/2022','25/10/2022','02/04/2022','28/02/2022','01/07/2021','23/06/2022','15/06/2022','07/04/2022','09/11/2022','18/08/2020','19/03/2020','17/01/202']
        }                

df = pd.DataFrame(data)
#to datetime
df['calc_date_id'] = pd.to_datetime(df['calc_date_id'])

`

我如何通過id獲取最后一次狀態更改的第一個日期？

我嘗試按日期排序，並使用 id 和 status 以及 keep="first" 進行分組，但我得到了：

按狀態分組

也試過

df_mt_date.loc[df_mt_date.groupby(['id',' status'])['calc_date_id'].idxmin()]

取而代之的是，我想按日期保留順序，僅獲取 ID 上次更改狀態的第一次（不是所有歷史記錄）。

這是所需的 output

我的想法用完了，我會很感激任何建議

謝謝

Answer 1

嘗試：

df["desired_output"] = df.groupby("id")["status"].transform(
    lambda x: df.loc[x.index, "calc_date_id"][(x != x.shift(-1)).idxmax()]
)
print(df)

印刷：

   id          status calc_date_id desired_output
0   1          Active   2022-07-05     2021-08-31
1   1          Active   2022-06-07     2021-08-31
2   1          Active   2021-08-31     2021-08-31
3   1  Pending Action   2021-07-01     2021-08-31
4   2  Pending Action   2022-11-20     2022-10-25
5   2  Pending Action   2022-10-25     2022-10-25
6   2          Active   2022-04-02     2022-10-25
7   2  Pending Action   2022-02-28     2022-10-25
8   2          Active   2021-07-01     2022-10-25
9   3           Draft   2022-06-23     2022-06-23
10  3          Active   2022-06-15     2022-06-23
11  3           Draft   2022-04-07     2022-06-23
12  3           Draft   2022-11-09     2022-06-23
13  3           Draft   2020-08-18     2022-06-23
14  3          Active   2020-03-19     2022-06-23
15  3           Draft   2020-01-17     2022-06-23

Answer 2

從你想要的 output 我看到，組“邊界”是狀態列的特定值第一次出現的點，無論id列如何。

要指示狀態列中值的首次出現，請運行：

wrk = df.groupby('status', group_keys=False).apply(
    lambda grp: grp.assign(isFirst=grp.index[0] == grp.index))
wrk.isFirst = wrk.isFirst.cumsum()

要查看結果，請打印wrk並查看isFirst列。

然后，要生成結果，請運行：

result = wrk.groupby('isFirst', group_keys=False).apply(
    lambda grp: grp.assign(desired_output=grp.calc_date_id.min()))\
    .drop(columns='isFirst')

請注意終止刪除以刪除現在不必要的isFirst列。

對於您的數據樣本，結果是：

   id          status calc_date_id desired_output
0   1          Active   2022-07-05     2021-08-31
1   1          Active   2022-06-07     2021-08-31
2   1          Active   2021-08-31     2021-08-31
3   1  Pending Action   2021-07-01     2021-07-01
4   2  Pending Action   2022-11-20     2021-07-01
5   2  Pending Action   2022-10-25     2021-07-01
6   2          Active   2022-04-02     2021-07-01
7   2  Pending Action   2022-02-28     2021-07-01
8   2          Active   2021-07-01     2021-07-01
9   3           Draft   2022-06-23     2020-03-19
10  3          Active   2022-06-15     2020-03-19
11  3           Draft   2022-04-07     2020-03-19
12  3           Draft   2022-11-09     2020-03-19
13  3           Draft   2020-08-18     2020-03-19
14  3          Active   2020-03-19     2020-03-19
15  3           Draft   2022-01-17     2020-03-19

Pandas groupby 兩列獲取最早日期

問題描述

2 個解決方案

解決方案1
0 2022-12-02 19:55:00

解決方案2
0 2022-12-02 20:54:06

Pandas groupby 兩列獲取最早日期

問題描述

2 個解決方案

解決方案1 0 2022-12-02 19:55:00

解決方案2 0 2022-12-02 20:54:06

解決方案1
0 2022-12-02 19:55:00

解決方案2
0 2022-12-02 20:54:06