I want to delete the duplicates rows in a pandas dataframe from the below dataframe on the column "msgid" and keep the values satisfying below conditions:
Start by evaluating "tr_flag":
I tried the approach of using sql: by using Case statement and partition by msgid. But not able to get all three scenarios able to get first and second only. Is sql ok or any other better approach?
dataset:
Date plid evid msgid tr_type tr_flag
0 08-11-2021 pl1 111 msg1 new False
1 08-11-2021 pl1 222 msg1 new False
2 08-11-2021 pl1 333 msg1 new False
3 08-11-2021 pl1 444 msg2 new False
4 08-11-2021 pl1 555 msg2 new True
5 08-11-2021 pl1 666 msg2 new False
6 08-11-2021 pl1 777 msg3 new True
7 08-11-2021 pl1 888 msg3 new True
8 08-11-2021 pl1 999 msg3 new True
You can assign a custom sorting key (here negative 'tr_flag' for True, positive for False), sort on the key, groupby
'msgid` and keep first row:
(df.assign(key=df['tr_flag'].eq(False).mul(2).sub(1).mul(df['evid']))
.sort_values(by='key')
.groupby('msgid').first()
.drop('key', axis=1)
)
output:
Date plid evid tr_type tr_flag
msgid
msg1 08-11-2021 pl1 111 new False
msg2 08-11-2021 pl1 555 new True
msg3 08-11-2021 pl1 999 new True
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.