我如何知道熊猫数据框中的重复行？

Question

I am working with Pandas and the function duplicated() to detect which rows are equal: 我正在使用Pandas和函数duplicated()来检测哪些行相等：

import pandas as pd

d = {
    1: {'name': 'n1', 1: 10, 2: 20, 3: 30},
    2: {'name': 'n2', 1: 10, 2: 20, 3: 30},
    3: {'name': 'n3', 1: 11, 2: 21, 3: 30},
    4: {'name': 'n4', 1: 11, 2: 21, 3: 30},
    5: {'name': 'n5', 1: 12, 2: 22, 3: 30},
    6: {'name': 'n6', 1: 13, 2: 22, 3: 30},
    7: {'name': 'n7', 1: 14,        3: 35},
    8: {'name': 'n8',        2: 22, 3: 35},
}
pd.DataFrame.from_dict(d).transpose().set_index('name')

This gives me a nice data frame like this one: 这给了我一个很好的数据框架，如下所示：

          1    2   3
name              
n1       10   20  30    # same as n2
n2       10   20  30    # same as n1
n3       11   21  30    # same as n4
n4       11   21  30    # same as n3
n5       12   22  30
n6       13   22  30
n7       14  NaN  35
n8      NaN   22  35

Now I want to group those lines whose columns are the same. 现在，我想对列相同的行进行分组。 That is, I want Pandas to tell me that the rows n1 and n2 are equal, and so n3 and n4 are. 也就是说，我希望熊猫告诉我行n1和n2相等，所以n3和n4也相等。

Using duplicated() I get some interesting results: 使用duplicated()我得到一些有趣的结果：

df[df.duplicated(keep=False)]
         1   2   3
name            
n1      10  20  30
n2      10  20  30
n3      11  21  30
n4      11  21  30

Which is correct, since it shows those rows that have duplicated. 这是正确的，因为它显示了重复的行。 However, my aim is to get to know which are those columns, as well as knowing which are the tuples of duplicates. 但是，我的目的是要了解哪些是那些列，以及哪些是重复的元组。 That is, I would need a result on the form of [(n1, n2), (n3,n4)] , a list with the duplicates one to each other. 也就是说，我需要一个[(n1, n2), (n3,n4)] ，一个彼此重复的列表。 List, dict, anything works to me as well as it has this info. 列表，字典，任何对我有用的信息以及它的信息。

I have been browsing through Pandas' documentation and cannot find something like this. 我一直在浏览熊猫的文档，找不到类似的东西。 I tried a bit with groupby() , but nothing reasonable comes up. 我用groupby()尝试了一下，但是没有任何合理的结果。

Answer 1

You can use groupby by all columns and convert indices to list for each group, last convert Series to list: 您可以按所有列使用groupby并将索引转换为每个组的列表，最后将Series转换为列表：

df1 = df[df.duplicated(keep=False)]

df1 = df1.groupby(df1.columns.tolist()).apply(lambda x: x.index.tolist()).values.tolist()
print (df1)
[['n1', 'n2'], ['n3', 'n4']]

Detail: 详情：

print (df1.groupby(df1.columns.tolist()).apply(lambda x: x.index.tolist()))
1   2   3 
10  20  30    [n1, n2]
11  21  30    [n3, n4]
dtype: object

我如何知道熊猫数据框中的重复行？

问题描述

1 个解决方案

解决方案1
5 已采纳 2018-02-21 11:18:20

我如何知道熊猫数据框中的重复行？

问题描述

1 个解决方案

解决方案1 5 已采纳 2018-02-21 11:18:20

解决方案1
5 已采纳 2018-02-21 11:18:20