[英]How to get previous rows of a pandas GroupedBy Dataframe based on a condition on the current row?
[英]How to get previous row with condition in a DataFrame of Pandas
每個記錄(名稱)都有日期和狀態(開始/處理/完成)。 如何獲取每一行的開始狀態日期? 謝謝你。
date name status
0 2020-10-01 name_01 Begin
1 2020-10-02 name_02 Begin
2 2020-10-03 name_01 Processing
3 2020-10-04 name_03 Begin
4 2020-10-05 name_02 Processing
5 2020-10-06 name_01 Finished
6 2020-10-07 name_02 Finished
7 2020-10-08 name_03 Processing
8 2020-10-09 name_03 Finished
我需要這個:
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
抱歉,我沒有提到名稱可以重新啟動它的狀態。 例如, name_01將再次出現“ Begin ”狀態。 見 9 和 10
像這樣
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
9 2020-10-10 name_01 Begin 2020-10-10
10 2020-10-11 name_01 Processing 2020-10-10
因此,不僅僅是找到唯一一個同名的“Begin”行。 必須找到同名“開始”狀態的最新記錄的日期。
抱歉我糟糕的英語表達。
樣本數據
date name status
0 2020-10-01 name_01 Begin
1 2020-10-02 name_02 Begin
2 2020-10-03 name_01 Processing
3 2020-10-05 name_02 Processing
4 2020-10-06 name_03 Begin
5 2020-10-07 name_01 Finished
6 2020-10-08 name_02 Finished
7 2020-10-09 name_03 Processing
8 2020-10-10 name_03 Finished
9 2020-10-11 name_01 Begin
10 2020-10-12 name_01 Processing
11 2020-10-13 name_02 Begin
12 2020-10-14 name_02 Processing
13 2020-10-15 name_02 Finished
14 2020-10-16 name_01 Finished
期待效果
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-05 name_02 Processing 2020-10-01
4 2020-10-06 name_03 Begin 2020-10-06
5 2020-10-07 name_01 Finished 2020-10-01
6 2020-10-08 name_02 Finished 2020-10-05
7 2020-10-09 name_03 Processing 2020-10-06
8 2020-10-10 name_03 Finished 2020-10-06
9 2020-10-11 name_01 Begin 2020-10-11
10 2020-10-12 name_01 Processing 2020-10-11
11 2020-10-13 name_02 Begin 2020-10-13
12 2020-10-14 name_02 Processing 2020-10-13
13 2020-10-15 name_02 Finished 2020-10-13
14 2020-10-16 name_01 Finished 2020-10-11
我試圖運行代碼
df['begin_at'] = df.groupby('name').apply(lambda grp:
grp.groupby((grp.status == 'Begin').cumsum(), as_index=False)
.date.transform('first'))
但它給了
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-11
1 2020-10-02 name_02 Begin 2020-10-13
2 2020-10-03 name_01 Processing 2020-10-11
3 2020-10-05 name_02 Processing 2020-10-13
4 2020-10-06 name_03 Begin NaT
5 2020-10-07 name_01 Finished 2020-10-11
6 2020-10-08 name_02 Finished 2020-10-13
7 2020-10-09 name_03 Processing NaT
8 2020-10-10 name_03 Finished NaT
9 2020-10-11 name_01 Begin NaT
10 2020-10-12 name_01 Processing NaT
11 2020-10-13 name_02 Begin NaT
12 2020-10-14 name_02 Processing NaT
13 2020-10-15 name_02 Finished NaT
14 2020-10-16 name_01 Finished NaT
這是整個代碼
import numpy as np
import pandas as pd
df = pd.DataFrame([
["2020-10-01", "name_01", "Begin"],
["2020-10-02", "name_02", "Begin"],
["2020-10-03", "name_01", "Processing"],
["2020-10-05", "name_02", "Processing"],
["2020-10-06", "name_03", "Begin"],
["2020-10-07", "name_01", "Finished"],
["2020-10-08", "name_02", "Finished"],
["2020-10-09", "name_03", "Processing"],
["2020-10-10", "name_03", "Finished"],
["2020-10-11", "name_01", "Begin"],
["2020-10-12", "name_01", "Processing"],
["2020-10-13", "name_02", "Begin"],
["2020-10-14", "name_02", "Processing"],
["2020-10-15", "name_02", "Finished"],
["2020-10-16", "name_01", "Finished"],
], columns=["date", "name", "status"])
df['date'] = pd.to_datetime(df.date)
df = df.sort_values(by="date")
print(df)
df['begin_at'] = df.groupby('name').apply(lambda grp:
grp.groupby(
(grp.status == 'Begin').cumsum(), as_index=False)
.date.transform('first'))
print(df)
以字母順序排列的優勢begin
, processing
, finished
,使用sort_values
和GROUPBY transform
first
df['begin_at'] = df.sort_values('status').groupby('name').date.transform('first')
Out[719]:
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
假設Begin
日期總是 <= Processing
或Finished
:
>>> df.assign(begin_at=df.groupby('name').date.transform(min))
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
創建一個輔助系列:
begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')
然后用它加入你的 DataFrame:
result = df.join(begin_at, on='name')
結果是:
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
或者,如果您不再需要原始 DataFrame,請將結果保存回df下。
您的文章僅包含開始,加工和成品事件的同名單曲循環。 但是如果有多個這樣的循環(至少對於一個name ),則需要一種不同的方法:
df['begin_at'] = df.groupby('name').apply(lambda grp: grp.groupby(
(grp.status == 'Begin').cumsum()).date.transform('first'))\
.reset_index(level=0, drop=True)
它由一個兩級分組組成。
然后,在每個二級組中,為所有成員行生成第一個日期。
另一個步驟是刪除 MultiIndex 的頂層,通過分組添加。 最初我試圖通過傳遞as_index=False來避免這個額外的索引級別,但顯然有時這種安排會失敗。
整個結果保存在新列下。
我找到了一個更短更簡單的解決方案。
創建一個只有開始日期的輔助系列:
begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')
結果是:
name name_01 2020-10-01 name_02 2020-10-02 name_03 2020-10-06 name_01 2020-10-11 name_02 2020-10-13 Name: begin_at, dtype: datetime64[ns]
然后合並(“asof”版本):
result = pd.merge_asof(df, begin_at, by='name', left_on='date', right_on='begin_at')
這個操作其實分為2個步驟:
使用%timeit檢查每個變體的執行時間,在一些更大的源數據樣本上。 我想最后一個變體會比我之前的變體運行得更快。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.