从 Pandas DataFrame 中选择一列中具有相同值但另一列中具有不同值的行

Question

Say I have the pandas DataFrame below:假设我有下面的 Pandas DataFrame：

   A      B     C   D
1  foo    one   0   0
2  foo    one   2   4
3  foo    two   4   8
4  cat    one   8   4
5  bar    four  6  12
6  bar    three 7  14
7  bar    four  7  14

I would like to select all the rows that have equal values in A but differing values in B. So I would like the output of my code to be:我想选择在 A 中具有相等值但在 B 中具有不同值的所有行。所以我希望我的代码的输出是：

   A      B    C   D
1  foo    one  0   0
3  foo    two  4   8
5  bar  three  7  14
6  bar    four 7  14

What's the most efficient way to do this?执行此操作的最有效方法是什么？ I have approximately 11,000 rows with a lot of variation in the column values, but this situation comes up a lot.我有大约 11,000 行，列值有很多变化，但这种情况经常出现。 In my dataset, if elements in column A are equal then the corresponding column B value should also be equal, however due to mislabeling this is not the case and I would like to fix this, it would be impractical for me to do this one by one.在我的数据集中，如果 A 列中的元素相等，那么相应的 B 列值也应该相等，但是由于错误标记，情况并非如此，我想解决这个问题，我这样做是不切实际的一。

Answer 1

You can try groupby() + filter + drop_duplicates() :您可以尝试groupby() + filter + drop_duplicates() ：

>>> df.groupby('A').filter(lambda g: len(g) > 1).drop_duplicates(subset=['A', 'B'], keep="first")
     A      B  C   D
0  foo    one  0   0
2  foo    two  4   8
4  bar   four  6  12
5  bar  three  7  14

OR, in case you want to drop duplicates between the subset of columns A & B then can use below but that will have the row having cat as well.或者，如果您想删除A列和B列子集之间的重复项，则可以在下面使用，但该行也将包含cat 。

>>> df.drop_duplicates(subset=['A', 'B'], keep="first")
     A      B  C   D
0  foo    one  0   0
2  foo    two  4   8
3  cat    one  8   4
4  bar   four  6  12
5  bar  three  7  14

Answer 2

Use groupby + filter + head :使用groupby + filter + head ：

result = df.groupby('A').filter(lambda g: len(g) > 1).groupby(['A', 'B']).head(1)
print(result)

Output输出

     A      B  C   D
0  foo    one  0   0
2  foo    two  4   8
4  bar   four  6  12
5  bar  three  7  14

The first group-by and filter will remove the rows with no duplicated A values (ie cat ), the second will create groups with same A, B and for each of those get the first element.第一个 group-by 和 filter 将删除没有重复A值的行（即cat ），第二个将创建具有相同A, B组A, B并为每个组获取第一个元素。

Answer 3

The current answers are correct and may be more sophisticated too.当前的答案是正确的，也可能更复杂。 If you have complex criteria, filter function will be very useful.如果您有复杂的标准，过滤功能将非常有用。 If you are like me and want to keep things simple, i feel following is more beginner friendly way如果你像我一样想保持简单，我觉得以下是更适合初学者的方式

>>> df = pd.DataFrame({
    'A': ['foo', 'foo', 'foo', 'cat', 'bar', 'bar', 'bar'],
    'B': ['one', 'one', 'two', 'one', 'four', 'three', 'four'],
    'C': [0,2,4,8,6,7,7],
    'D': [0,4,8,4,12,14,14]
}, index=[1,2,3,4,5,6,7])

>>> df = df.drop_duplicates(['A', 'B'], keep='last')
    A       B       C   D
2   foo     one     2   4
3   foo     two     4   8
4   cat     one     8   4
6   bar     three   7   14
7   bar     four    7   14


>>> df = df[df.duplicated(['A'], keep=False)]
    A       B       C   D
2   foo     one     2   4
3   foo     two     4   8
6   bar     three   7   14
7   bar     four    7   14

keep='last' is optional here keep='last'在这里是可选的

从 Pandas DataFrame 中选择一列中具有相同值但另一列中具有不同值的行

问题描述

3 个解决方案

解决方案1
6 2019-01-04 17:31:59

解决方案2
2 2019-01-04 17:30:18

解决方案3
1 2019-12-09 16:42:59

从 Pandas DataFrame 中选择一列中具有相同值但另一列中具有不同值的行

问题描述

3 个解决方案

解决方案1 6 2019-01-04 17:31:59

解决方案2 2 2019-01-04 17:30:18

解决方案3 1 2019-12-09 16:42:59

解决方案1
6 2019-01-04 17:31:59

解决方案2
2 2019-01-04 17:30:18

解决方案3
1 2019-12-09 16:42:59