简体   繁体   English

我如何知道熊猫数据框中的重复行?

[英]How can I know which are the duplicated rows in a Pandas Data Frame?

I am working with Pandas and the function duplicated() to detect which rows are equal: 我正在使用Pandas和函数duplicated()来检测哪些行相等:

import pandas as pd

d = {
    1: {'name': 'n1', 1: 10, 2: 20, 3: 30},
    2: {'name': 'n2', 1: 10, 2: 20, 3: 30},
    3: {'name': 'n3', 1: 11, 2: 21, 3: 30},
    4: {'name': 'n4', 1: 11, 2: 21, 3: 30},
    5: {'name': 'n5', 1: 12, 2: 22, 3: 30},
    6: {'name': 'n6', 1: 13, 2: 22, 3: 30},
    7: {'name': 'n7', 1: 14,        3: 35},
    8: {'name': 'n8',        2: 22, 3: 35},
}
pd.DataFrame.from_dict(d).transpose().set_index('name')

This gives me a nice data frame like this one: 这给了我一个很好的数据框架,如下所示:

          1    2   3
name              
n1       10   20  30    # same as n2
n2       10   20  30    # same as n1
n3       11   21  30    # same as n4
n4       11   21  30    # same as n3
n5       12   22  30
n6       13   22  30
n7       14  NaN  35
n8      NaN   22  35

Now I want to group those lines whose columns are the same. 现在,我想对列相同的行进行分组。 That is, I want Pandas to tell me that the rows n1 and n2 are equal, and so n3 and n4 are. 也就是说,我希望熊猫告诉我行n1n2相等,所以n3n4也相等。

Using duplicated() I get some interesting results: 使用duplicated()我得到一些有趣的结果:

df[df.duplicated(keep=False)]
         1   2   3
name            
n1      10  20  30
n2      10  20  30
n3      11  21  30
n4      11  21  30

Which is correct, since it shows those rows that have duplicated. 这是正确的,因为它显示了重复的行。 However, my aim is to get to know which are those columns, as well as knowing which are the tuples of duplicates. 但是,我的目的是要了解哪些是那些列,以及哪些是重复的元组。 That is, I would need a result on the form of [(n1, n2), (n3,n4)] , a list with the duplicates one to each other. 也就是说,我需要一个[(n1, n2), (n3,n4)] ,一个彼此重复的列表。 List, dict, anything works to me as well as it has this info. 列表,字典,任何对我有用的信息以及它的信息。

I have been browsing through Pandas' documentation and cannot find something like this. 我一直在浏览熊猫的文档,找不到类似的东西。 I tried a bit with groupby() , but nothing reasonable comes up. 我用groupby()尝试了一下,但是没有任何合理的结果。

You can use groupby by all columns and convert indices to list for each group, last convert Series to list: 您可以按所有列使用groupby并将索引转换为每个组的列表,最后将Series转换为列表:

df1 = df[df.duplicated(keep=False)]

df1 = df1.groupby(df1.columns.tolist()).apply(lambda x: x.index.tolist()).values.tolist()
print (df1)
[['n1', 'n2'], ['n3', 'n4']]

Detail: 详情:

print (df1.groupby(df1.columns.tolist()).apply(lambda x: x.index.tolist()))
1   2   3 
10  20  30    [n1, n2]
11  21  30    [n3, n4]
dtype: object

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将重复的行添加到 Pandas DF? - how can I add duplicated rows to a Pandas DF? 在 pandas 数据框中,我可以过滤以仅显示满足列数可变的数据框中每一列的标准的行吗? - In a pandas data frame can I filter to only show rows which meet a criteria for every column in a data frame with variable number of columns? 如何在 pandas 中依次检查数据帧的某些行是否在多个数据帧中匹配 - How can I check if some rows of a data frame has matches in multiple data frames, sequentially in pandas 熊猫:如何找到仅存在于一个数据框中的丢失数据? - Pandas : How can I find missing data which is existing in only one data frame? 如何使用具有相同id的多行展平pandas数据框 - How can I flatten a pandas data frame with several rows with the same id 如何将熊猫数据框中的字符串设置在所有行的相同位置? - how can I set the string in pandas data frame in the same position in all rows? 如何将相同 pandas 数据帧的相同行连接在一起? - How can I join same rows together of the same pandas data frame? 如何将 pandas 数据框分解为间隔的每一分钟有单独的行? - How can I explode a pandas data frame to have separate rows for each minute of an interval? pandas:如何根据列值在一个数据帧中从另一个数据帧中 append 行? - pandas: How can I append rows in one data frame from another based on column values? 如何在包含现有行字符串中的单词的pandas数据框中创建新行? - How can I create new rows in a pandas data frame containing the words in a string of an existing row?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM