[英]Remove duplicated but with priority for keep first in pandas
Here is a df:这是一个df:
COL1 COL2 COL3
seqA NA 10
seqA Unknown 5
seqA Cow 50
seqB NA 2
seqC NA 2
seqC Unknown 2
seqC Bird 6
seqC Cow 1
seqD Unknown 30
seqD Shark 2
so the idea would bee to remove duplicated COL1 value and keep only one with the lowest COL3
BUT only take ones with NA
or Unknown
containt if there is no other COL3 value < 10
所以想法是删除重复的 COL1 值并只保留一个具有最低COL3
值,但如果没有其他COL3 value < 10
使用NA
或Unknown
COL3 value < 10
for instance for SeqA
例如对于SeqA
I keep我一直
seqA Unknown 5
because thise one is > 10 :因为 thise > 10 :
seqA Cow 50
but in seqC I keep :但在 seqC 我保持:
seqC Cow 1
because it is <10
因为它<10
In the exemple the expected output would be :在示例中,预期输出为:
COL1 COL2 COL3
seqA Unknown 5
seqB NA 2
seqC Cow 1
seqD Shark 2
So one idea would be to first do a所以一个想法是首先做一个
tab=df.sort_values(by=['COL3'], ascending = True)
But I do not know how to integrate the priority by the fact that everything different from Unknwown or NA is a priority except it its COL3 > 10但我不知道如何整合优先级,因为与 Unknwown 或 NA 不同的所有内容都是优先级,除了它的 COL3 > 10
Let us do filter then sort_values
+ drop_duplicates
让我们做过滤然后sort_values
+ drop_duplicates
out = df[df.COL3.lt(10) | df.COL2.eq('Unknown')].sort_values('COL3').drop_duplicates('COL1').sort_index()
Out[47]:
COL1 COL2 COL3
1 seqA Unknown 5
3 seqB NaN 2
7 seqC Cow 1
9 seqD Shark 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.