简体   繁体   中英

How to get previous row with condition in a DataFrame of Pandas

Each record(name) has date and status(begin/processing/finished). How to get the date of Begin status for each row? Thank you.

         date     name      status
0  2020-10-01  name_01       Begin
1  2020-10-02  name_02       Begin
2  2020-10-03  name_01  Processing
3  2020-10-04  name_03       Begin
4  2020-10-05  name_02  Processing
5  2020-10-06  name_01    Finished
6  2020-10-07  name_02    Finished
7  2020-10-08  name_03  Processing
8  2020-10-09  name_03    Finished

I need this:

         date     name      status  begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

Edited

Sorry, I did not mention that the Names can restart it's status. For example, name_01 will appear " Begin " status again. See 9 and 10

Like this

         date     name      status  begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04
9  2020-10-10  name_01       Begin  2020-10-10
10 2020-10-11  name_01  Processing  2020-10-10

Therefore, it is not just to find the only one "Begin" row of same name. The date of the latest record of "Begin" status of same name should have to be found.

Sorry for my poor English expression.


Update:

Sample Data

         date     name      status
0  2020-10-01  name_01       Begin
1  2020-10-02  name_02       Begin
2  2020-10-03  name_01  Processing
3  2020-10-05  name_02  Processing
4  2020-10-06  name_03       Begin
5  2020-10-07  name_01    Finished
6  2020-10-08  name_02    Finished
7  2020-10-09  name_03  Processing
8  2020-10-10  name_03    Finished
9  2020-10-11  name_01       Begin
10 2020-10-12  name_01  Processing
11 2020-10-13  name_02       Begin
12 2020-10-14  name_02  Processing
13 2020-10-15  name_02    Finished
14 2020-10-16  name_01    Finished

Expect effect

         date     name      status  begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-05  name_02  Processing  2020-10-01
4  2020-10-06  name_03       Begin  2020-10-06
5  2020-10-07  name_01    Finished  2020-10-01
6  2020-10-08  name_02    Finished  2020-10-05
7  2020-10-09  name_03  Processing  2020-10-06
8  2020-10-10  name_03    Finished  2020-10-06
9  2020-10-11  name_01       Begin  2020-10-11
10 2020-10-12  name_01  Processing  2020-10-11
11 2020-10-13  name_02       Begin  2020-10-13
12 2020-10-14  name_02  Processing  2020-10-13
13 2020-10-15  name_02    Finished  2020-10-13
14 2020-10-16  name_01    Finished  2020-10-11

I tried to run the code

df['begin_at'] = df.groupby('name').apply(lambda grp:
    grp.groupby((grp.status == 'Begin').cumsum(), as_index=False)
    .date.transform('first'))

but it gave

         date     name      status   begin_at
0  2020-10-01  name_01       Begin 2020-10-11
1  2020-10-02  name_02       Begin 2020-10-13
2  2020-10-03  name_01  Processing 2020-10-11
3  2020-10-05  name_02  Processing 2020-10-13
4  2020-10-06  name_03       Begin        NaT
5  2020-10-07  name_01    Finished 2020-10-11
6  2020-10-08  name_02    Finished 2020-10-13
7  2020-10-09  name_03  Processing        NaT
8  2020-10-10  name_03    Finished        NaT
9  2020-10-11  name_01       Begin        NaT
10 2020-10-12  name_01  Processing        NaT
11 2020-10-13  name_02       Begin        NaT
12 2020-10-14  name_02  Processing        NaT
13 2020-10-15  name_02    Finished        NaT
14 2020-10-16  name_01    Finished        NaT

Here is the whole code

import numpy as np
import pandas as pd
df = pd.DataFrame([
    ["2020-10-01", "name_01", "Begin"],
    ["2020-10-02", "name_02", "Begin"],
    ["2020-10-03", "name_01", "Processing"],
    ["2020-10-05", "name_02", "Processing"],
    ["2020-10-06", "name_03", "Begin"],
    ["2020-10-07", "name_01", "Finished"],
    ["2020-10-08", "name_02", "Finished"],
    ["2020-10-09", "name_03", "Processing"],
    ["2020-10-10", "name_03", "Finished"],
    ["2020-10-11", "name_01", "Begin"],
    ["2020-10-12", "name_01", "Processing"],
    ["2020-10-13", "name_02", "Begin"],
    ["2020-10-14", "name_02", "Processing"],
    ["2020-10-15", "name_02", "Finished"],
    ["2020-10-16", "name_01", "Finished"],
], columns=["date", "name", "status"])
df['date'] = pd.to_datetime(df.date)
df = df.sort_values(by="date")

print(df)

df['begin_at'] = df.groupby('name').apply(lambda grp:
                                          grp.groupby(
                                              (grp.status == 'Begin').cumsum(), as_index=False)
                                          .date.transform('first'))
print(df)

Take advantage of alphabetical order of begin , processing , finished , Use sort_values and groupby transform first

df['begin_at'] = df.sort_values('status').groupby('name').date.transform('first')

Out[719]:
         date     name      status    begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

Assuming the Begin date is always <= Processing or Finished :

>>> df.assign(begin_at=df.groupby('name').date.transform(min))
         date     name      status    begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

Create an auxiliary Series :

begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')

Then join your DataFrame with it:

result = df.join(begin_at, on='name')

The result is:

         date     name      status    begin_at
0  2020-10-01  name_01       Begin  2020-10-01
1  2020-10-02  name_02       Begin  2020-10-02
2  2020-10-03  name_01  Processing  2020-10-01
3  2020-10-04  name_03       Begin  2020-10-04
4  2020-10-05  name_02  Processing  2020-10-02
5  2020-10-06  name_01    Finished  2020-10-01
6  2020-10-07  name_02    Finished  2020-10-02
7  2020-10-08  name_03  Processing  2020-10-04
8  2020-10-09  name_03    Finished  2020-10-04

Or, if you don't need the original DataFrame any more, save the result back under df .

Edit

Your post contained only a single cycle of Begin , Processing and Finished events for the same name . But if there are multiple such cycles (at least for one name ), a different approach is needed:

df['begin_at'] = df.groupby('name').apply(lambda grp: grp.groupby(
    (grp.status == 'Begin').cumsum()).date.transform('first'))\
    .reset_index(level=0, drop=True)

It consists a two-level grouping.

  • first level - by name ,
  • second level - by each "group" starting from Begin status.

Then, within each second-level group, for all member rows the first date is generated.

An additional step is to drop the top level of MultiIndex, added by grouping. Initially I attempted to avoid this additional index level by passing as_index=False , but apparently sometimes this arrangement fails.

And the whole result is saved under the new column.

Edit 2

I have found a shorter and sipler solution.

  1. Create an auxiliary Series with begin dates only:

     begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')

    The result is:

     name name_01 2020-10-01 name_02 2020-10-02 name_03 2020-10-06 name_01 2020-10-11 name_02 2020-10-13 Name: begin_at, dtype: datetime64[ns]
  2. Then merge ("asof" version):

     result = pd.merge_asof(df, begin_at, by='name', left_on='date', right_on='begin_at')

    This operation is actually broken down into 2 steps:

    • First rows of df and elements of begin_at are matched by name .
    • Then the actual merge is performed, in the default ( backwards ) direction, so for each row from df there is a lookup for equal or nearest earlier date in begin_at , from the "current group" of elements, with the matching value of name (the index).

Using %timeit check the execution time of each variant, on some bigger source data sample. I suppose that the last variant will run faster than my earlier variants.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM