[英]Pandas remove duplicates with a criteria
Say I have the following dataframe: 说我有以下数据框:
>>> import pandas as pd
>>>
>>> d=pd.DataFrame()
>>>
>>> d['Var1']=['A','A','B','B','C','C','D','E','F']
>>> d['Var2']=['A','Z','B','Y','X','C','Q','N','P']
>>> d['Value']=[34, 45, 23, 54, 65, 77,100,102,44]
>>> d
Var1 Var2 Value
0 A A 34
1 A Z 45
2 B B 23
3 B Y 54
4 C X 65
5 C C 77
6 D Q 100
7 E N 102
8 F P 44
>>>
I want to drop cases where there are duplicates in "Var1", but I want to make sure that the duplicate that is kept is the one where 'Var1'=='Var2' 我想删除“ Var1”中有重复项的情况,但我想确保保留的重复项是“ Var1” ==“ Var2”的情况
My output dataframe would be: 我的输出数据框将是:
Var2 Value
Var1
A A 34
B B 23
C C 77
D Q 100
E N 102
F P 44
>>>
Any suggestions as to how I can do this? 关于如何执行此操作的任何建议? Would using groupby filter be the best approach?
使用groupby过滤器将是最好的方法吗?
Here's a one-liner: 这里是单线:
>>> d.loc[~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last')]
Var1 Var2 Value
0 A A 34
2 B B 23
5 C C 77
6 D Q 100
7 E N 102
8 F P 44
You can then set the index on Var1
if you want ( d.set_index('Var1')
) to get exactly the output you posted. 然后,您可以根据需要在
Var1
上设置索引( d.set_index('Var1')
),以准确获取发布的输出。
To break it down: 分解:
d.Var1[(d.Var1 == d.Var2).argsort()]
is series with values in Var1
arranged in such a way that the rows where Var1 == Var2
are at the end d.Var1[(d.Var1 == d.Var2).argsort()]
与Var1
中的值串联,以这样的方式排列: Var1 == Var2
的行位于末尾
~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last')
is true for rows where Var1
is non-duplicated; ~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last')
对于未复制Var1
true; if there are duplicates we pick the last one (so Var1 == Var2
has priority) 如果有重复项,我们选择最后一个(因此
Var1 == Var2
具有优先权)
My suggestion would be to create Var 2 and the Value as a dictionary. 我的建议是创建Var 2和Value作为字典。
d['Var1']=['A','A','B','B','C','C','D','E','F']
d['Var2']=['A','Z','B','Y','X','C','Q','N','P']
d['Var2Val'] = {'A':34,'Z':45,'B':23,'Y':54,'X':65,'C':77,'Q':100,'N':102,'P':44}
Then I would create a list for Var1 without duplicates, print those if they are in Var2 然后我将为Var1创建一个没有重复的列表,如果它们在Var2中,则将其打印出来
for x in d['Var1']:
if d['Var1'][x] in d['Var2']:
then print the table 然后打印表格
at least this would be the simplest way, even though it might be a little long 至少这可能是最简单的方法,即使它可能有点长
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.