熊猫使用条件删除重复项

Question

Say I have the following dataframe: 说我有以下数据框：

>>> import pandas as pd
>>> 
>>> d=pd.DataFrame()
>>> 
>>> d['Var1']=['A','A','B','B','C','C','D','E','F']
>>> d['Var2']=['A','Z','B','Y','X','C','Q','N','P']
>>> d['Value']=[34, 45, 23, 54, 65, 77,100,102,44]
>>> d
  Var1 Var2  Value
0    A    A     34
1    A    Z     45
2    B    B     23
3    B    Y     54
4    C    X     65
5    C    C     77
6    D    Q    100
7    E    N    102
8    F    P     44
>>>

I want to drop cases where there are duplicates in "Var1", but I want to make sure that the duplicate that is kept is the one where 'Var1'=='Var2' 我想删除“ Var1”中有重复项的情况，但我想确保保留的重复项是“ Var1” ==“ Var2”的情况

My output dataframe would be: 我的输出数据框将是：

     Var2  Value
Var1            
A       A     34
B       B     23
C       C     77
D       Q    100
E       N    102
F       P     44
>>>

Any suggestions as to how I can do this? 关于如何执行此操作的任何建议？ Would using groupby filter be the best approach? 使用groupby过滤器将是最好的方法吗？

Answer 1

Here's a one-liner: 这里是单线：

>>> d.loc[~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last')]

  Var1 Var2  Value
0    A    A     34
2    B    B     23
5    C    C     77
6    D    Q    100
7    E    N    102
8    F    P     44

You can then set the index on Var1 if you want ( d.set_index('Var1') ) to get exactly the output you posted. 然后，您可以根据需要在Var1上设置索引（ d.set_index('Var1') ），以准确获取发布的输出。

To break it down: 分解：

d.Var1[(d.Var1 == d.Var2).argsort()] is series with values in Var1 arranged in such a way that the rows where Var1 == Var2 are at the end d.Var1[(d.Var1 == d.Var2).argsort()]与Var1中的值串联，以这样的方式排列： Var1 == Var2的行位于末尾
~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last') is true for rows where Var1 is non-duplicated; ~d.Var1[(d.Var1 == d.Var2).argsort()].duplicated('last')对于未复制Var1 true； if there are duplicates we pick the last one (so Var1 == Var2 has priority) 如果有重复项，我们选择最后一个（因此Var1 == Var2具有优先权）

Answer 2

My suggestion would be to create Var 2 and the Value as a dictionary. 我的建议是创建Var 2和Value作为字典。

    d['Var1']=['A','A','B','B','C','C','D','E','F']
    d['Var2']=['A','Z','B','Y','X','C','Q','N','P']
    d['Var2Val'] = {'A':34,'Z':45,'B':23,'Y':54,'X':65,'C':77,'Q':100,'N':102,'P':44}

Then I would create a list for Var1 without duplicates, print those if they are in Var2 然后我将为Var1创建一个没有重复的列表，如果它们在Var2中，则将其打印出来

    for x in d['Var1']:
        if d['Var1'][x] in d['Var2']:

then print the table 然后打印表格

at least this would be the simplest way, even though it might be a little long 至少这可能是最简单的方法，即使它可能有点长

熊猫使用条件删除重复项

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-05-23 19:35:06

解决方案2
0 2016-05-23 19:40:52

熊猫使用条件删除重复项

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-05-23 19:35:06

解决方案2 0 2016-05-23 19:40:52

解决方案1
2 已采纳 2016-05-23 19:35:06

解决方案2
0 2016-05-23 19:40:52