[英]How can I know which are the duplicated rows in a Pandas Data Frame?
I am working with Pandas and the function duplicated()
to detect which rows are equal: 我正在使用Pandas和函数
duplicated()
来检测哪些行相等:
import pandas as pd
d = {
1: {'name': 'n1', 1: 10, 2: 20, 3: 30},
2: {'name': 'n2', 1: 10, 2: 20, 3: 30},
3: {'name': 'n3', 1: 11, 2: 21, 3: 30},
4: {'name': 'n4', 1: 11, 2: 21, 3: 30},
5: {'name': 'n5', 1: 12, 2: 22, 3: 30},
6: {'name': 'n6', 1: 13, 2: 22, 3: 30},
7: {'name': 'n7', 1: 14, 3: 35},
8: {'name': 'n8', 2: 22, 3: 35},
}
pd.DataFrame.from_dict(d).transpose().set_index('name')
This gives me a nice data frame like this one: 这给了我一个很好的数据框架,如下所示:
1 2 3
name
n1 10 20 30 # same as n2
n2 10 20 30 # same as n1
n3 11 21 30 # same as n4
n4 11 21 30 # same as n3
n5 12 22 30
n6 13 22 30
n7 14 NaN 35
n8 NaN 22 35
Now I want to group those lines whose columns are the same. 现在,我想对列相同的行进行分组。 That is, I want Pandas to tell me that the rows
n1
and n2
are equal, and so n3
and n4
are. 也就是说,我希望熊猫告诉我行
n1
和n2
相等,所以n3
和n4
也相等。
Using duplicated()
I get some interesting results: 使用
duplicated()
我得到一些有趣的结果:
df[df.duplicated(keep=False)]
1 2 3
name
n1 10 20 30
n2 10 20 30
n3 11 21 30
n4 11 21 30
Which is correct, since it shows those rows that have duplicated. 这是正确的,因为它显示了重复的行。 However, my aim is to get to know which are those columns, as well as knowing which are the tuples of duplicates.
但是,我的目的是要了解哪些是那些列,以及哪些是重复的元组。 That is, I would need a result on the form of
[(n1, n2), (n3,n4)]
, a list with the duplicates one to each other. 也就是说,我需要一个
[(n1, n2), (n3,n4)]
,一个彼此重复的列表。 List, dict, anything works to me as well as it has this info. 列表,字典,任何对我有用的信息以及它的信息。
I have been browsing through Pandas' documentation and cannot find something like this. 我一直在浏览熊猫的文档,找不到类似的东西。 I tried a bit with
groupby()
, but nothing reasonable comes up. 我用
groupby()
尝试了一下,但是没有任何合理的结果。
You can use groupby
by all columns and convert indices to list for each group, last convert Series
to list: 您可以按所有列使用
groupby
并将索引转换为每个组的列表,最后将Series
转换为列表:
df1 = df[df.duplicated(keep=False)]
df1 = df1.groupby(df1.columns.tolist()).apply(lambda x: x.index.tolist()).values.tolist()
print (df1)
[['n1', 'n2'], ['n3', 'n4']]
Detail: 详情:
print (df1.groupby(df1.columns.tolist()).apply(lambda x: x.index.tolist()))
1 2 3
10 20 30 [n1, n2]
11 21 30 [n3, n4]
dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.