[英]Getting the days since last occurence of a value within a group
我有以下 dataframe:
客户ID | 开始日期 | 结束日期 | 事件 |
---|---|---|---|
1个 | 2022-01-01 | 2022-01-03 | 错误的 |
1个 | 2022-01-02 | 2022-01-04 | 真的 |
1个 | 2022-01-04 | 2022-01-06 | 错误的 |
1个 | 2022-01-05 | 2022-01-08 | 错误的 |
1个 | 2022-01-05 | 2022-01-06 | 错误的 |
我知道想知道每个客户和行,最后一次事件发生在多少天前。 准确地说:对于每一行,我想知道从 start_date 到最后一行事件的 end_date 的天数 == True
这将是所需的 output。
客户ID | 开始日期 | 结束日期 | 事件 | days_since_last_incident |
---|---|---|---|---|
1个 | 2022-01-01 | 2022-01-03 | 错误的 | 楠 |
1个 | 2022-01-02 | 2022-01-04 | 真的 | 楠 |
1个 | 2022-01-04 | 2022-01-06 | 错误的 | 0 |
1个 | 2022-01-05 | 2022-01-08 | 错误的 | 1个 |
1个 | 2022-01-05 | 2022-01-06 | 错误的 | 1个 |
有一个优雅的解决方案吗?
到目前为止,我尝试使用应用 function,然后将另一个 function 应用到每一行,但是对于那些没有任何先前事件的行,这会引发越界错误。 到目前为止,这是我的尝试。 它仅适用于具有先前事件的行:
def days_since_last_incident(group):
group["days_since_last_incidents"] = group.apply(
lambda row: (
row["start_date"]
- (
group[
(group["incident"] == True)
& (group["end_date"] <= row["start_date"])
]["end_date"].values
)
).days,
axis=1,
)
df.groupby("customer_id").apply(days_since_last_incident)
尝试(下面的解释):
# Not mandatory if start_date and end_date are already DatetimeIndex
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
def days_from_last_incident(df):
return (df['end_date'].where(df['incident']).ffill()
.rsub(df['start_date'])
.mask(df['incident'])
.dt.days.to_frame())
df['days'] = df.groupby(['customer_id'], group_keys=False).apply(days_from_last_incident)
Output:
>>> out
customer_id start_date end_date incident days
0 1 2022-01-01 2022-01-03 False NaN
1 1 2022-01-02 2022-01-04 True NaN
2 1 2022-01-04 2022-01-06 False 0.0
3 1 2022-01-05 2022-01-08 False 1.0
4 1 2022-01-05 2022-01-06 False 1.0
第一组的步骤(customer_id=1)
# Step 1: keep only end_date where incident is true the front fill the end date
>>> out = df['end_date'].where(df['incident']).ffill()
0 NaT
1 2022-01-04
2 2022-01-04
3 2022-01-04
4 2022-01-04
# Step 2: (right) subtract start_date
>>> out = out.rsub(df['start_date'])
0 NaT
1 -2 days
2 0 days
3 1 days
4 1 days
# Step 3: remove delta for all rows where incident is true
>>> out = out.mask(df['incident'])
0 NaT
1 NaT
2 0 days
3 1 days
4 1 days
dtype: timedelta64[ns]
# Step 4: keep the day part of timedelta
>>> out = out.dt.days.to_frame()
0
0 NaN
1 NaN
2 0.0
3 1.0
4 1.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.