[英]Python: Remove duplicates from DataFrame based on another column value
[英]Remove duplicates based on a value in column of a dataframe
Senario
key associated_keys value associated_value
KP6070 KP706010/KP706020/KP706030/KP706040/KP706050/KP706060/ AFE.706070.KP AFE.706010.RT
KP6650 KP706610/KP706620//KP706630/KP706640/KP706650 AFE.706650.KP AFE.706010.RT
我試過 python 腳本。
Deduptest.groupby(['associated_keys']).max()['associated_value'].reset_index()
Deduptest.drop_duplicates(['associated_value'],keep= 'first')
預計出局
key associated_keys value associated_value
KP6070 KP706010/KP706020/KP706030/KP706040/KP706050/KP706060/ AFE.706070.KP AFE.706010.RT
我正在嘗試根據associated_value
列和associated_keys
刪除重復項。 如果associated_keys
中的值已經存在於該列的任何其他行中,並且對於這兩行,如果associated_value
列數據相同,那么我想要其中具有最高長度或更多數據的行。
我嘗試drop_duplicates
並嘗試使用長度 function 但我一直在我的 output 中獲取這兩行。
嘗試:
# set up the key to get proper order
df["sort_key"]=df["associated_keys"].str.len()
# sort by that key
df.sort_values("sort_key", inplace=True, ascending=False)
# drop, keeping only the first record (in sorted dataframe, so the one with highest Len)
df.drop_duplicates(subset="associated_value", keep="first", inplace=True)
# drop sort column
df.drop("sort_key", axis=1, inplace=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.