简体   繁体   English

删除重复但优先保留在熊猫中

[英]Remove duplicated but with priority for keep first in pandas

Here is a df:这是一个df:

COL1 COL2 COL3 
seqA NA 10
seqA Unknown 5
seqA Cow 50
seqB NA 2
seqC NA 2
seqC Unknown 2
seqC Bird 6
seqC Cow 1
seqD Unknown 30
seqD Shark 2

so the idea would bee to remove duplicated COL1 value and keep only one with the lowest COL3 BUT only take ones with NA or Unknown containt if there is no other COL3 value < 10所以想法是删除重复的 COL1 值并只保留一个具有最低COL3值,但如果没有其他COL3 value < 10使用NAUnknown COL3 value < 10

for instance for SeqA例如对于SeqA

I keep我一直

seqA Unknown 5

because thise one is > 10 :因为 thise > 10 :

seqA Cow 50

but in seqC I keep :但在 seqC 我保持:

seqC Cow 1

because it is <10因为它<10

In the exemple the expected output would be :在示例中,预期输出为:

COL1 COL2 COL3 
seqA Unknown 5
seqB NA 2
seqC Cow 1
seqD Shark 2

So one idea would be to first do a所以一个想法是首先做一个

tab=df.sort_values(by=['COL3'], ascending = True)

But I do not know how to integrate the priority by the fact that everything different from Unknwown or NA is a priority except it its COL3 > 10但我不知道如何整合优先级,因为与 Unknwown 或 NA 不同的所有内容都是优先级,除了它的 COL3 > 10

Let us do filter then sort_values + drop_duplicates让我们做过滤然后sort_values + drop_duplicates

out = df[df.COL3.lt(10) | df.COL2.eq('Unknown')].sort_values('COL3').drop_duplicates('COL1').sort_index()
Out[47]: 
   COL1     COL2  COL3
1  seqA  Unknown     5
3  seqB      NaN     2
7  seqC      Cow     1
9  seqD    Shark     2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM