简体   繁体   English

熊猫使用条件删除重复项

[英]Pandas remove duplicates with a criteria

Say I have the following dataframe: 说我有以下数据框:

>>> import pandas as pd
>>> 
>>> d=pd.DataFrame()
>>> 
>>> d['Var1']=['A','A','B','B','C','C','D','E','F']
>>> d['Var2']=['A','Z','B','Y','X','C','Q','N','P']
>>> d['Value']=[34, 45, 23, 54, 65, 77,100,102,44]
>>> d
  Var1 Var2  Value
0    A    A     34
1    A    Z     45
2    B    B     23
3    B    Y     54
4    C    X     65
5    C    C     77
6    D    Q    100
7    E    N    102
8    F    P     44
>>> 

I want to drop cases where there are duplicates in "Var1", but I want to make sure that the duplicate that is kept is the one where 'Var1'=='Var2' 我想删除“ Var1”中有重复项的情况,但我想确保保留的重复项是“ Var1” ==“ Var2”的情况

My output dataframe would be: 我的输出数据框将是:

     Var2  Value
Var1            
A       A     34
B       B     23
C       C     77
D       Q    100
E       N    102
F       P     44
>>> 

Any suggestions as to how I can do this? 关于如何执行此操作的任何建议? Would using groupby filter be the best approach? 使用groupby过滤器将是最好的方法吗?

Here's a one-liner: 这里是单线:

>>> d.loc[~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last')]

  Var1 Var2  Value
0    A    A     34
2    B    B     23
5    C    C     77
6    D    Q    100
7    E    N    102
8    F    P     44

You can then set the index on Var1 if you want ( d.set_index('Var1') ) to get exactly the output you posted. 然后,您可以根据需要在Var1上设置索引( d.set_index('Var1') ),以准确获取发布的输出。

To break it down: 分解:

  • d.Var1[(d.Var1 == d.Var2).argsort()] is series with values in Var1 arranged in such a way that the rows where Var1 == Var2 are at the end d.Var1[(d.Var1 == d.Var2).argsort()]Var1中的值串联,以这样的方式排列: Var1 == Var2的行位于末尾

  • ~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last') is true for rows where Var1 is non-duplicated; ~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last')对于未复制Var1 true; if there are duplicates we pick the last one (so Var1 == Var2 has priority) 如果有重复项,我们选择最后一个(因此Var1 == Var2具有优先权)

My suggestion would be to create Var 2 and the Value as a dictionary. 我的建议是创建Var 2和Value作为字典。

    d['Var1']=['A','A','B','B','C','C','D','E','F']
    d['Var2']=['A','Z','B','Y','X','C','Q','N','P']
    d['Var2Val'] = {'A':34,'Z':45,'B':23,'Y':54,'X':65,'C':77,'Q':100,'N':102,'P':44}

Then I would create a list for Var1 without duplicates, print those if they are in Var2 然后我将为Var1创建一个没有重复的列表,如果它们在Var2中,则将其打印出来

    for x in d['Var1']:
        if d['Var1'][x] in d['Var2']:

then print the table 然后打印表格

at least this would be the simplest way, even though it might be a little long 至少这可能是最简单的方法,即使它可能有点长

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM