Getting the days since last occurence of a value within a group

Question

I have the following dataframe:

customer_id	start_date	end_date	incident
1	2022-01-01	2022-01-03	False
1	2022-01-02	2022-01-04	True
1	2022-01-04	2022-01-06	False
1	2022-01-05	2022-01-08	False
1	2022-01-05	2022-01-06	False

I know want to know for each customer and row, how many days ago the last incident occured. To be precise: For each row I want to know the days from start_date until the end_date of the last row with incident == True

This would be the desired output.

customer_id	start_date	end_date	incident	days_since_last_incident
1	2022-01-01	2022-01-03	False	Nan
1	2022-01-02	2022-01-04	True	Nan
1	2022-01-04	2022-01-06	False	0
1	2022-01-05	2022-01-08	False	1
1	2022-01-05	2022-01-06	False	1

Is there an elegant solution to this?

So far, I tried to work with an apply function, and then applied another function to each row, but that threw out of bounds error for those rows for which they weren't any previous incidents. Here is my attempt so far. It only works for rows with previous incidents:

def days_since_last_incident(group):
    group["days_since_last_incidents"] = group.apply(
        lambda row: (
            row["start_date"]
            - (
                group[
                    (group["incident"] == True)
                    & (group["end_date"] <= row["start_date"])
                ]["end_date"].values
            )
        ).days,
        axis=1,
    )


df.groupby("customer_id").apply(days_since_last_incident)

Answer 1

Try (explanation below):

# Not mandatory if start_date and end_date are already DatetimeIndex
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

def days_from_last_incident(df):
    return (df['end_date'].where(df['incident']).ffill()
                          .rsub(df['start_date'])
                          .mask(df['incident'])
                          .dt.days.to_frame())
    
df['days'] = df.groupby(['customer_id'], group_keys=False).apply(days_from_last_incident)

Output:

>>> out

   customer_id start_date   end_date  incident  days
0            1 2022-01-01 2022-01-03     False   NaN
1            1 2022-01-02 2022-01-04      True   NaN
2            1 2022-01-04 2022-01-06     False   0.0
3            1 2022-01-05 2022-01-08     False   1.0
4            1 2022-01-05 2022-01-06     False   1.0

Step by step for the first group (customer_id=1)

# Step 1: keep only end_date where incident is true the front fill the end date
>>> out = df['end_date'].where(df['incident']).ffill()
0          NaT
1   2022-01-04
2   2022-01-04
3   2022-01-04
4   2022-01-04

# Step 2: (right) subtract start_date
>>> out = out.rsub(df['start_date'])
0       NaT
1   -2 days
2    0 days
3    1 days
4    1 days

# Step 3: remove delta for all rows where incident is true
>>> out = out.mask(df['incident'])
0      NaT
1      NaT
2   0 days
3   1 days
4   1 days
dtype: timedelta64[ns]

# Step 4: keep the day part of timedelta
>>> out = out.dt.days.to_frame()
     0
0  NaN
1  NaN
2  0.0
3  1.0
4  1.0

Getting the days since last occurence of a value within a group

Question

1 answers

solution1
0 2023-01-21 17:33:37

Getting the days since last occurence of a value within a group

Question

1 answers

solution1 0 2023-01-21 17:33:37

solution1
0 2023-01-21 17:33:37