简体   繁体   中英

Getting the days since last occurence of a value within a group

I have the following dataframe:

customer_id start_date end_date incident
1 2022-01-01 2022-01-03 False
1 2022-01-02 2022-01-04 True
1 2022-01-04 2022-01-06 False
1 2022-01-05 2022-01-08 False
1 2022-01-05 2022-01-06 False

I know want to know for each customer and row, how many days ago the last incident occured. To be precise: For each row I want to know the days from start_date until the end_date of the last row with incident == True

This would be the desired output.

customer_id start_date end_date incident days_since_last_incident
1 2022-01-01 2022-01-03 False Nan
1 2022-01-02 2022-01-04 True Nan
1 2022-01-04 2022-01-06 False 0
1 2022-01-05 2022-01-08 False 1
1 2022-01-05 2022-01-06 False 1

Is there an elegant solution to this?

So far, I tried to work with an apply function, and then applied another function to each row, but that threw out of bounds error for those rows for which they weren't any previous incidents. Here is my attempt so far. It only works for rows with previous incidents:

def days_since_last_incident(group):
    group["days_since_last_incidents"] = group.apply(
        lambda row: (
            row["start_date"]
            - (
                group[
                    (group["incident"] == True)
                    & (group["end_date"] <= row["start_date"])
                ]["end_date"].values
            )
        ).days,
        axis=1,
    )


df.groupby("customer_id").apply(days_since_last_incident)

Try (explanation below):

# Not mandatory if start_date and end_date are already DatetimeIndex
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

def days_from_last_incident(df):
    return (df['end_date'].where(df['incident']).ffill()
                          .rsub(df['start_date'])
                          .mask(df['incident'])
                          .dt.days.to_frame())
    
df['days'] = df.groupby(['customer_id'], group_keys=False).apply(days_from_last_incident)

Output:

>>> out

   customer_id start_date   end_date  incident  days
0            1 2022-01-01 2022-01-03     False   NaN
1            1 2022-01-02 2022-01-04      True   NaN
2            1 2022-01-04 2022-01-06     False   0.0
3            1 2022-01-05 2022-01-08     False   1.0
4            1 2022-01-05 2022-01-06     False   1.0

Step by step for the first group (customer_id=1)

# Step 1: keep only end_date where incident is true the front fill the end date
>>> out = df['end_date'].where(df['incident']).ffill()
0          NaT
1   2022-01-04
2   2022-01-04
3   2022-01-04
4   2022-01-04

# Step 2: (right) subtract start_date
>>> out = out.rsub(df['start_date'])
0       NaT
1   -2 days
2    0 days
3    1 days
4    1 days

# Step 3: remove delta for all rows where incident is true
>>> out = out.mask(df['incident'])
0      NaT
1      NaT
2   0 days
3   1 days
4   1 days
dtype: timedelta64[ns]

# Step 4: keep the day part of timedelta
>>> out = out.dt.days.to_frame()
     0
0  NaN
1  NaN
2  0.0
3  1.0
4  1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM