简体   繁体   中英

Filter pandas dataframe based on opposite condition whether True/False in a column

I want to delete the duplicates rows in a pandas dataframe from the below dataframe on the column "msgid" and keep the values satisfying below conditions:

Start by evaluating "tr_flag":

  1. if mix of True and False, then keep True
  2. if all False, then keep min(evid)
  3. if more than one true then keep max(evid).

I tried the approach of using sql: by using Case statement and partition by msgid. But not able to get all three scenarios able to get first and second only. Is sql ok or any other better approach?

dataset:

         Date plid  evid msgid tr_type  tr_flag
0  08-11-2021  pl1   111  msg1     new    False
1  08-11-2021  pl1   222  msg1     new    False
2  08-11-2021  pl1   333  msg1     new    False
3  08-11-2021  pl1   444  msg2     new    False
4  08-11-2021  pl1   555  msg2     new     True
5  08-11-2021  pl1   666  msg2     new    False
6  08-11-2021  pl1   777  msg3     new     True
7  08-11-2021  pl1   888  msg3     new     True
8  08-11-2021  pl1   999  msg3     new     True

You can assign a custom sorting key (here negative 'tr_flag' for True, positive for False), sort on the key, groupby 'msgid` and keep first row:

(df.assign(key=df['tr_flag'].eq(False).mul(2).sub(1).mul(df['evid']))
   .sort_values(by='key')
   .groupby('msgid').first()
   .drop('key', axis=1)
)

output:

             Date plid  evid tr_type  tr_flag
msgid                                        
msg1   08-11-2021  pl1   111     new    False
msg2   08-11-2021  pl1   555     new     True
msg3   08-11-2021  pl1   999     new     True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM