简体   繁体   English

获取自上次在组中出现某个值以来的天数

[英]Getting the days since last occurence of a value within a group

I have the following dataframe:我有以下 dataframe:

customer_id客户ID start_date开始日期 end_date结束日期 incident事件
1 1个 2022-01-01 2022-01-01 2022-01-03 2022-01-03 False错误的
1 1个 2022-01-02 2022-01-02 2022-01-04 2022-01-04 True真的
1 1个 2022-01-04 2022-01-04 2022-01-06 2022-01-06 False错误的
1 1个 2022-01-05 2022-01-05 2022-01-08 2022-01-08 False错误的
1 1个 2022-01-05 2022-01-05 2022-01-06 2022-01-06 False错误的

I know want to know for each customer and row, how many days ago the last incident occured.我知道想知道每个客户和行,最后一次事件发生在多少天前。 To be precise: For each row I want to know the days from start_date until the end_date of the last row with incident == True准确地说:对于每一行,我想知道从 start_date 到最后一行事件的 end_date 的天数 == True

This would be the desired output.这将是所需的 output。

customer_id客户ID start_date开始日期 end_date结束日期 incident事件 days_since_last_incident days_since_last_incident
1 1个 2022-01-01 2022-01-01 2022-01-03 2022-01-03 False错误的 Nan
1 1个 2022-01-02 2022-01-02 2022-01-04 2022-01-04 True真的 Nan
1 1个 2022-01-04 2022-01-04 2022-01-06 2022-01-06 False错误的 0 0
1 1个 2022-01-05 2022-01-05 2022-01-08 2022-01-08 False错误的 1 1个
1 1个 2022-01-05 2022-01-05 2022-01-06 2022-01-06 False错误的 1 1个

Is there an elegant solution to this?有一个优雅的解决方案吗?

So far, I tried to work with an apply function, and then applied another function to each row, but that threw out of bounds error for those rows for which they weren't any previous incidents.到目前为止,我尝试使用应用 function,然后将另一个 function 应用到每一行,但是对于那些没有任何先前事件的行,这会引发越界错误。 Here is my attempt so far.到目前为止,这是我的尝试。 It only works for rows with previous incidents:它仅适用于具有先前事件的行:

def days_since_last_incident(group):
    group["days_since_last_incidents"] = group.apply(
        lambda row: (
            row["start_date"]
            - (
                group[
                    (group["incident"] == True)
                    & (group["end_date"] <= row["start_date"])
                ]["end_date"].values
            )
        ).days,
        axis=1,
    )


df.groupby("customer_id").apply(days_since_last_incident)

Try (explanation below):尝试(下面的解释):

# Not mandatory if start_date and end_date are already DatetimeIndex
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

def days_from_last_incident(df):
    return (df['end_date'].where(df['incident']).ffill()
                          .rsub(df['start_date'])
                          .mask(df['incident'])
                          .dt.days.to_frame())
    
df['days'] = df.groupby(['customer_id'], group_keys=False).apply(days_from_last_incident)

Output: Output:

>>> out

   customer_id start_date   end_date  incident  days
0            1 2022-01-01 2022-01-03     False   NaN
1            1 2022-01-02 2022-01-04      True   NaN
2            1 2022-01-04 2022-01-06     False   0.0
3            1 2022-01-05 2022-01-08     False   1.0
4            1 2022-01-05 2022-01-06     False   1.0

Step by step for the first group (customer_id=1)第一组的步骤(customer_id=1)

# Step 1: keep only end_date where incident is true the front fill the end date
>>> out = df['end_date'].where(df['incident']).ffill()
0          NaT
1   2022-01-04
2   2022-01-04
3   2022-01-04
4   2022-01-04

# Step 2: (right) subtract start_date
>>> out = out.rsub(df['start_date'])
0       NaT
1   -2 days
2    0 days
3    1 days
4    1 days

# Step 3: remove delta for all rows where incident is true
>>> out = out.mask(df['incident'])
0      NaT
1      NaT
2   0 days
3   1 days
4   1 days
dtype: timedelta64[ns]

# Step 4: keep the day part of timedelta
>>> out = out.dt.days.to_frame()
     0
0  NaN
1  NaN
2  0.0
3  1.0
4  1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM