I have the following dataframe:
customer_id | start_date | end_date | incident |
---|---|---|---|
1 | 2022-01-01 | 2022-01-03 | False |
1 | 2022-01-02 | 2022-01-04 | True |
1 | 2022-01-04 | 2022-01-06 | False |
1 | 2022-01-05 | 2022-01-08 | False |
1 | 2022-01-05 | 2022-01-06 | False |
I know want to know for each customer and row, how many days ago the last incident occured. To be precise: For each row I want to know the days from start_date until the end_date of the last row with incident == True
This would be the desired output.
customer_id | start_date | end_date | incident | days_since_last_incident |
---|---|---|---|---|
1 | 2022-01-01 | 2022-01-03 | False | Nan |
1 | 2022-01-02 | 2022-01-04 | True | Nan |
1 | 2022-01-04 | 2022-01-06 | False | 0 |
1 | 2022-01-05 | 2022-01-08 | False | 1 |
1 | 2022-01-05 | 2022-01-06 | False | 1 |
Is there an elegant solution to this?
So far, I tried to work with an apply function, and then applied another function to each row, but that threw out of bounds error for those rows for which they weren't any previous incidents. Here is my attempt so far. It only works for rows with previous incidents:
def days_since_last_incident(group):
group["days_since_last_incidents"] = group.apply(
lambda row: (
row["start_date"]
- (
group[
(group["incident"] == True)
& (group["end_date"] <= row["start_date"])
]["end_date"].values
)
).days,
axis=1,
)
df.groupby("customer_id").apply(days_since_last_incident)
Try (explanation below):
# Not mandatory if start_date and end_date are already DatetimeIndex
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
def days_from_last_incident(df):
return (df['end_date'].where(df['incident']).ffill()
.rsub(df['start_date'])
.mask(df['incident'])
.dt.days.to_frame())
df['days'] = df.groupby(['customer_id'], group_keys=False).apply(days_from_last_incident)
Output:
>>> out
customer_id start_date end_date incident days
0 1 2022-01-01 2022-01-03 False NaN
1 1 2022-01-02 2022-01-04 True NaN
2 1 2022-01-04 2022-01-06 False 0.0
3 1 2022-01-05 2022-01-08 False 1.0
4 1 2022-01-05 2022-01-06 False 1.0
Step by step for the first group (customer_id=1)
# Step 1: keep only end_date where incident is true the front fill the end date
>>> out = df['end_date'].where(df['incident']).ffill()
0 NaT
1 2022-01-04
2 2022-01-04
3 2022-01-04
4 2022-01-04
# Step 2: (right) subtract start_date
>>> out = out.rsub(df['start_date'])
0 NaT
1 -2 days
2 0 days
3 1 days
4 1 days
# Step 3: remove delta for all rows where incident is true
>>> out = out.mask(df['incident'])
0 NaT
1 NaT
2 0 days
3 1 days
4 1 days
dtype: timedelta64[ns]
# Step 4: keep the day part of timedelta
>>> out = out.dt.days.to_frame()
0
0 NaN
1 NaN
2 0.0
3 1.0
4 1.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.