繁体   English   中英

在具有多个条件的分组 dataframe 上制作自定义过滤器

[英]Make a customized filter on a grouped dataframe with multiple conditions

请在下面找到我的输入/所需的 output:

输入

      Id   Status        Date
0  Id001      off  01/01/2022
1  Id001   online  01/06/2022
2  Id002      off  05/12/2021
3  Id002      off  30/08/2021
4  Id003  running  02/03/2021
5  Id004   online  21/10/2021
6  Id004  running  01/04/2022
7  Id005   online  01/02/2022
8  Id005   online  02/07/2022

OUTPUT(需要)

      Id   Status        Date
0  Id001   online  01/06/2022
1  Id002      off  05/12/2021
2  Id003  running  02/03/2021
3  Id004   online  21/10/2021
4  Id005   online  01/02/2022

目标首先是在 output 中每个 Id 有一行。
output 将基于以下简单语句制成:

if a group of Id has «online» as a Status then:  
    if «online» occurences > 1 then: 
        Choose the oldest Id that has «online»
    else:  
        choose the Id that has «online»
else:
    Choose the most recent Id

这是我迄今为止尝试过的:


import pandas as pd
import numpy as np

df = pd.DataFrame({'Id': ['Id001','Id001','Id002','Id002','Id003','Id004','Id004','Id005','Id005'],
                    'Status': ['off','online','off','off','running','online','running','online','online'],
                    'Date': ['01/01/2022','01/06/2022','05/12/2021','30/08/2021','02/03/2021','21/10/2021','01/04/2022','01/02/2022','02/07/2022']})

df.groupby(['Id', 'Status'], as_index=False).agg({'Date' : [np.min, np.max]}).reset_index()

请问您有什么建议/建议吗?
我们非常感谢任何帮助!

答案已完全编辑 - 首先仅过滤online行,按Date排序并按第一个Id删除重复项:

df1 = df[df['Status'].eq('online')].sort_values('Date').drop_duplicates('Id')
print (df1)
      Id  Status       Date
5  Id004  online 2021-10-21
7  Id005  online 2022-02-01
1  Id001  online 2022-06-01

然后过滤不匹配的Id并降序排序:

df2 =df[~df['Id'].isin(df1['Id'])].sort_values('Date',ascending=False).drop_duplicates('Id')
print (df2)
      Id   Status       Date
2  Id002      off 2021-12-05
4  Id003  running 2021-03-02

最后加入两个数据框:

df = pd.concat([df1, df2]).sort_values('Id', ignore_index=True)
print (df)
      Id   Status       Date
0  Id001   online 2022-06-01
1  Id002      off 2021-12-05
2  Id003  running 2021-03-02
3  Id004   online 2021-10-21
4  Id005   online 2022-02-01

应更改原始解决方案:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df1 = (df.assign(s = df['Status'].eq('online')).groupby(['Id','s'])
         .agg(Date_min=('Date','idxmin'), Date_max=('Date','idxmax')))

df1 = df1[~df1.index.get_level_values(0).duplicated(keep='last')].reset_index()
print (df1)
      Id      s  Date_min  Date_max
0  Id001   True         1         1
1  Id002  False         3         2
2  Id003  False         4         4
3  Id004   True         5         5
4  Id005   True         7         8

df = df.loc[np.where(df1['s'], df1['Date_min'], df1['Date_max'])]
print (df)
      Id   Status       Date
1  Id001   online 2022-06-01
2  Id002      off 2021-12-05
4  Id003  running 2021-03-02
5  Id004   online 2021-10-21
7  Id005   online 2022-02-01

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM