获取自上次在组中出现某个值以来的天数

Question

I have the following dataframe:我有以下 dataframe：

customer_id客户ID	start_date开始日期	end_date结束日期	incident事件
1 1个	2022-01-01 2022-01-01	2022-01-03 2022-01-03	False错误的
1 1个	2022-01-02 2022-01-02	2022-01-04 2022-01-04	True真的
1 1个	2022-01-04 2022-01-04	2022-01-06 2022-01-06	False错误的
1 1个	2022-01-05 2022-01-05	2022-01-08 2022-01-08	False错误的
1 1个	2022-01-05 2022-01-05	2022-01-06 2022-01-06	False错误的

I know want to know for each customer and row, how many days ago the last incident occured.我知道想知道每个客户和行，最后一次事件发生在多少天前。 To be precise: For each row I want to know the days from start_date until the end_date of the last row with incident == True准确地说：对于每一行，我想知道从 start_date 到最后一行事件的 end_date 的天数 == True

This would be the desired output.这将是所需的 output。

customer_id客户ID	start_date开始日期	end_date结束日期	incident事件	days_since_last_incident days_since_last_incident
1 1个	2022-01-01 2022-01-01	2022-01-03 2022-01-03	False错误的	Nan楠
1 1个	2022-01-02 2022-01-02	2022-01-04 2022-01-04	True真的	Nan楠
1 1个	2022-01-04 2022-01-04	2022-01-06 2022-01-06	False错误的	0 0
1 1个	2022-01-05 2022-01-05	2022-01-08 2022-01-08	False错误的	1 1个
1 1个	2022-01-05 2022-01-05	2022-01-06 2022-01-06	False错误的	1 1个

Is there an elegant solution to this?有一个优雅的解决方案吗？

So far, I tried to work with an apply function, and then applied another function to each row, but that threw out of bounds error for those rows for which they weren't any previous incidents.到目前为止，我尝试使用应用 function，然后将另一个 function 应用到每一行，但是对于那些没有任何先前事件的行，这会引发越界错误。 Here is my attempt so far.到目前为止，这是我的尝试。 It only works for rows with previous incidents:它仅适用于具有先前事件的行：

def days_since_last_incident(group):
    group["days_since_last_incidents"] = group.apply(
        lambda row: (
            row["start_date"]
            - (
                group[
                    (group["incident"] == True)
                    & (group["end_date"] <= row["start_date"])
                ]["end_date"].values
            )
        ).days,
        axis=1,
    )


df.groupby("customer_id").apply(days_since_last_incident)

Answer 1

Try (explanation below):尝试（下面的解释）：

# Not mandatory if start_date and end_date are already DatetimeIndex
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

def days_from_last_incident(df):
    return (df['end_date'].where(df['incident']).ffill()
                          .rsub(df['start_date'])
                          .mask(df['incident'])
                          .dt.days.to_frame())
    
df['days'] = df.groupby(['customer_id'], group_keys=False).apply(days_from_last_incident)

Output: Output：

>>> out

   customer_id start_date   end_date  incident  days
0            1 2022-01-01 2022-01-03     False   NaN
1            1 2022-01-02 2022-01-04      True   NaN
2            1 2022-01-04 2022-01-06     False   0.0
3            1 2022-01-05 2022-01-08     False   1.0
4            1 2022-01-05 2022-01-06     False   1.0

Step by step for the first group (customer_id=1)第一组的步骤（customer_id=1）

# Step 1: keep only end_date where incident is true the front fill the end date
>>> out = df['end_date'].where(df['incident']).ffill()
0          NaT
1   2022-01-04
2   2022-01-04
3   2022-01-04
4   2022-01-04

# Step 2: (right) subtract start_date
>>> out = out.rsub(df['start_date'])
0       NaT
1   -2 days
2    0 days
3    1 days
4    1 days

# Step 3: remove delta for all rows where incident is true
>>> out = out.mask(df['incident'])
0      NaT
1      NaT
2   0 days
3   1 days
4   1 days
dtype: timedelta64[ns]

# Step 4: keep the day part of timedelta
>>> out = out.dt.days.to_frame()
     0
0  NaN
1  NaN
2  0.0
3  1.0
4  1.0

获取自上次在组中出现某个值以来的天数

问题描述

1 个解决方案

解决方案1
0 2023-01-21 17:33:37

获取自上次在组中出现某个值以来的天数

问题描述

1 个解决方案

解决方案1 0 2023-01-21 17:33:37

解决方案1
0 2023-01-21 17:33:37