在具有多个条件的分组 dataframe 上制作自定义过滤器

Question

请在下面找到我的输入/所需的 output：

输入

      Id   Status        Date
0  Id001      off  01/01/2022
1  Id001   online  01/06/2022
2  Id002      off  05/12/2021
3  Id002      off  30/08/2021
4  Id003  running  02/03/2021
5  Id004   online  21/10/2021
6  Id004  running  01/04/2022
7  Id005   online  01/02/2022
8  Id005   online  02/07/2022

OUTPUT（需要）

      Id   Status        Date
0  Id001   online  01/06/2022
1  Id002      off  05/12/2021
2  Id003  running  02/03/2021
3  Id004   online  21/10/2021
4  Id005   online  01/02/2022

目标首先是在 output 中每个 Id 有一行。
output 将基于以下简单语句制成：

if a group of Id has «online» as a Status then:  
    if «online» occurences > 1 then: 
        Choose the oldest Id that has «online»
    else:  
        choose the Id that has «online»
else:
    Choose the most recent Id

这是我迄今为止尝试过的：


import pandas as pd
import numpy as np

df = pd.DataFrame({'Id': ['Id001','Id001','Id002','Id002','Id003','Id004','Id004','Id005','Id005'],
                    'Status': ['off','online','off','off','running','online','running','online','online'],
                    'Date': ['01/01/2022','01/06/2022','05/12/2021','30/08/2021','02/03/2021','21/10/2021','01/04/2022','01/02/2022','02/07/2022']})

df.groupby(['Id', 'Status'], as_index=False).agg({'Date' : [np.min, np.max]}).reset_index()

请问您有什么建议/建议吗？
我们非常感谢任何帮助！

Answer 1

答案已完全编辑 - 首先仅过滤online行，按Date排序并按第一个Id删除重复项：

df1 = df[df['Status'].eq('online')].sort_values('Date').drop_duplicates('Id')
print (df1)
      Id  Status       Date
5  Id004  online 2021-10-21
7  Id005  online 2022-02-01
1  Id001  online 2022-06-01

然后过滤不匹配的Id并降序排序：

df2 =df[~df['Id'].isin(df1['Id'])].sort_values('Date',ascending=False).drop_duplicates('Id')
print (df2)
      Id   Status       Date
2  Id002      off 2021-12-05
4  Id003  running 2021-03-02

最后加入两个数据框：

df = pd.concat([df1, df2]).sort_values('Id', ignore_index=True)
print (df)
      Id   Status       Date
0  Id001   online 2022-06-01
1  Id002      off 2021-12-05
2  Id003  running 2021-03-02
3  Id004   online 2021-10-21
4  Id005   online 2022-02-01

应更改原始解决方案：

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df1 = (df.assign(s = df['Status'].eq('online')).groupby(['Id','s'])
         .agg(Date_min=('Date','idxmin'), Date_max=('Date','idxmax')))

df1 = df1[~df1.index.get_level_values(0).duplicated(keep='last')].reset_index()
print (df1)
      Id      s  Date_min  Date_max
0  Id001   True         1         1
1  Id002  False         3         2
2  Id003  False         4         4
3  Id004   True         5         5
4  Id005   True         7         8

df = df.loc[np.where(df1['s'], df1['Date_min'], df1['Date_max'])]
print (df)
      Id   Status       Date
1  Id001   online 2022-06-01
2  Id002      off 2021-12-05
4  Id003  running 2021-03-02
5  Id004   online 2021-10-21
7  Id005   online 2022-02-01

在具有多个条件的分组 dataframe 上制作自定义过滤器

问题描述

输入

OUTPUT（需要）

1 个解决方案

解决方案1
4 已采纳 2022-09-09 09:41:08

在具有多个条件的分组 dataframe 上制作自定义过滤器

问题描述

输入

OUTPUT（需要）

1 个解决方案

解决方案1 4 已采纳 2022-09-09 09:41:08

解决方案1
4 已采纳 2022-09-09 09:41:08