繁体   English   中英

如何在 Pandas 的 DataFrame 中获取带有条件的前一行

[英]How to get previous row with condition in a DataFrame of Pandas

每个记录(名称)都有日期和状态(开始/处理/完成)。 如何获取每一行的开始状态日期? 谢谢你。

         date     name      status
0  2020-10-01  name_01       Begin
1  2020-10-02  name_02       Begin
2  2020-10-03  name_01  Processing
3  2020-10-04  name_03       Begin
4  2020-10-05  name_02  Processing
5  2020-10-06  name_01    Finished
6  2020-10-07  name_02    Finished
7  2020-10-08  name_03  Processing
8  2020-10-09  name_03    Finished

我需要这个:

         date     name      status  begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

已编辑

抱歉,我没有提到名称可以重新启动它的状态。 例如, name_01将再次出现“ Begin ”状态。 见 9 和 10

像这样

         date     name      status  begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04
9  2020-10-10  name_01       Begin  2020-10-10
10 2020-10-11  name_01  Processing  2020-10-10

因此,不仅仅是找到唯一一个同名的“Begin”行。 必须找到同名“开始”状态的最新记录的日期。

抱歉我糟糕的英语表达。


更新:

样本数据

         date     name      status
0  2020-10-01  name_01       Begin
1  2020-10-02  name_02       Begin
2  2020-10-03  name_01  Processing
3  2020-10-05  name_02  Processing
4  2020-10-06  name_03       Begin
5  2020-10-07  name_01    Finished
6  2020-10-08  name_02    Finished
7  2020-10-09  name_03  Processing
8  2020-10-10  name_03    Finished
9  2020-10-11  name_01       Begin
10 2020-10-12  name_01  Processing
11 2020-10-13  name_02       Begin
12 2020-10-14  name_02  Processing
13 2020-10-15  name_02    Finished
14 2020-10-16  name_01    Finished

期待效果

         date     name      status  begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-05  name_02  Processing  2020-10-01
4  2020-10-06  name_03       Begin  2020-10-06
5  2020-10-07  name_01    Finished  2020-10-01
6  2020-10-08  name_02    Finished  2020-10-05
7  2020-10-09  name_03  Processing  2020-10-06
8  2020-10-10  name_03    Finished  2020-10-06
9  2020-10-11  name_01       Begin  2020-10-11
10 2020-10-12  name_01  Processing  2020-10-11
11 2020-10-13  name_02       Begin  2020-10-13
12 2020-10-14  name_02  Processing  2020-10-13
13 2020-10-15  name_02    Finished  2020-10-13
14 2020-10-16  name_01    Finished  2020-10-11

我试图运行代码

df['begin_at'] = df.groupby('name').apply(lambda grp:
    grp.groupby((grp.status == 'Begin').cumsum(), as_index=False)
    .date.transform('first'))

但它给了

         date     name      status   begin_at
0  2020-10-01  name_01       Begin 2020-10-11
1  2020-10-02  name_02       Begin 2020-10-13
2  2020-10-03  name_01  Processing 2020-10-11
3  2020-10-05  name_02  Processing 2020-10-13
4  2020-10-06  name_03       Begin        NaT
5  2020-10-07  name_01    Finished 2020-10-11
6  2020-10-08  name_02    Finished 2020-10-13
7  2020-10-09  name_03  Processing        NaT
8  2020-10-10  name_03    Finished        NaT
9  2020-10-11  name_01       Begin        NaT
10 2020-10-12  name_01  Processing        NaT
11 2020-10-13  name_02       Begin        NaT
12 2020-10-14  name_02  Processing        NaT
13 2020-10-15  name_02    Finished        NaT
14 2020-10-16  name_01    Finished        NaT

这是整个代码

import numpy as np
import pandas as pd
df = pd.DataFrame([
    ["2020-10-01", "name_01", "Begin"],
    ["2020-10-02", "name_02", "Begin"],
    ["2020-10-03", "name_01", "Processing"],
    ["2020-10-05", "name_02", "Processing"],
    ["2020-10-06", "name_03", "Begin"],
    ["2020-10-07", "name_01", "Finished"],
    ["2020-10-08", "name_02", "Finished"],
    ["2020-10-09", "name_03", "Processing"],
    ["2020-10-10", "name_03", "Finished"],
    ["2020-10-11", "name_01", "Begin"],
    ["2020-10-12", "name_01", "Processing"],
    ["2020-10-13", "name_02", "Begin"],
    ["2020-10-14", "name_02", "Processing"],
    ["2020-10-15", "name_02", "Finished"],
    ["2020-10-16", "name_01", "Finished"],
], columns=["date", "name", "status"])
df['date'] = pd.to_datetime(df.date)
df = df.sort_values(by="date")

print(df)

df['begin_at'] = df.groupby('name').apply(lambda grp:
                                          grp.groupby(
                                              (grp.status == 'Begin').cumsum(), as_index=False)
                                          .date.transform('first'))
print(df)

以字母顺序排列的优势beginprocessingfinished ,使用sort_values和GROUPBY transform first

df['begin_at'] = df.sort_values('status').groupby('name').date.transform('first')

Out[719]:
         date     name      status    begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

假设Begin日期总是 <= ProcessingFinished

>>> df.assign(begin_at=df.groupby('name').date.transform(min))
         date     name      status    begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

创建一个辅助系列

begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')

然后用它加入你的 DataFrame:

result = df.join(begin_at, on='name')

结果是:

         date     name      status    begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

或者,如果您不再需要原始 DataFrame,请将结果保存回df下。

编辑

您的文章仅包含开始加工成品事件的同名单曲循环。 但是如果有多个这样的循环(至少对于一个name ),则需要一种不同的方法:

df['begin_at'] = df.groupby('name').apply(lambda grp: grp.groupby(
    (grp.status == 'Begin').cumsum()).date.transform('first'))\
    .reset_index(level=0, drop=True)

它由一个两级分组组成。

  • 第一级 - 按名称
  • 第二级 - 从开始状态开始的每个“组”。

然后,在每个二级组中,为所有成员行生成第一个日期

另一个步骤是删除 MultiIndex 的顶层,通过分组添加。 最初我试图通过传递as_index=False来避免这个额外的索引级别,但显然有时这种安排会失败。

整个结果保存在新列下。

编辑 2

我找到了一个更短更简单的解决方案。

  1. 创建一个只有开始日期的辅助系列

     begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')

    结果是:

     name name_01 2020-10-01 name_02 2020-10-02 name_03 2020-10-06 name_01 2020-10-11 name_02 2020-10-13 Name: begin_at, dtype: datetime64[ns]
  2. 然后合并(“asof”版本):

     result = pd.merge_asof(df, begin_at, by='name', left_on='date', right_on='begin_at')

    这个操作其实分为2个步骤:

    • df 的第一行和begin_at 的元素按name匹配。
    • 然后在默认(向后)方向上执行实际合并,因此对于来自df 的每一行,在begin_at 中查找相等或最近的较早日期,从元素的“当前组”,匹配值名称(指数)。

使用%timeit检查每个变体的执行时间,在一些更大的源数据样本上。 我想最后一个变体会比我之前的变体运行得更快。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM