简体   繁体   English

仅保留来自熊猫数据框的唯一重复项

[英]Only Retain Unique Duplicates from Pandas Dataframe

EDIT: desired output for the example given: 编辑:给定示例的所需输出:

first second third fourth fifth
1     2      3     4      5

EDIT 2: changed count() to size() 编辑2:将count()更改为size()

I've come across several instances when analyzing data where I'd like to return all duplicated rows, but only one row for each duplicate. 在分析数据时,我遇到了多个实例,这些实例中我想返回所有重复的行,但每个重复仅返回一行。 I'm trying to do so within Pandas with Python 3. 我正在尝试使用Python 3在Pandas中这样做。

Using groupby and count I can get the output I'm looking for, but it's not intuitive. 使用groupby和count可以获取所需的输出,但这并不直观。 The pandas "duplicated" function doesn't return the desired output as it returns multiple rows if there are more than two duplicates. 大熊猫的“重复”功能不会返回所需的输出,因为如果重复项多于两个,它将返回多行。

    data = [[1,2,3,4,5],
           [1,2,3,4,5],
           [1,2,3,4,5],
           [4,5,6,7,8]]

    x.columns = ['first','second','third','fourth','fifth']

    x = pd.DataFrame(data)

    x.groupby(list(x.columns)).size() > 1

The groupby function returns the desired dataframe output, while using groupby函数在使用时返回所需的数据帧输出

x[x.duplicated(keep='first')]

will still return duplicate rows. 仍会返回重复的行。 Is there a more pythonic way of only returning the unique duplicates? 是否有仅返回唯一重复项的更Python方式?

Use 采用

x.drop_duplicates()

first   second  third   fourth  fifth
0   1   2   3   4   5
3   4   5   6   7   8

You can chain what you select already with duplicated and then drop_duplicates such as: 您可以将已选择的内容与duplicated ,然后将drop_duplicates例如:

print (x[x.duplicated()].drop_duplicates())
   first  second  third  fourth  fifth
1      1       2      3       4      5

You can still use .duplicated() to check whether the row is a duplicate or not. 您仍然可以使用.duplicated()来检查该行是否重复。 If it is a duplicate, then it will return True . 如果重复,则返回True

After that, we create a flag, and then do a looping to get the duplicated row only. 之后,我们创建一个标志,然后执行循环以仅获取重复的行。 Check my code for details how I did it. 检查我的代码以获取详细信息。

import pandas as pd

data = [[1,2,3,4,5],
        [1,2,3,4,5],
        [1,2,3,4,5],
        [4,5,6,7,8]]

x = pd.DataFrame(data)
x.columns = ['first','second','third','fourth','fifth']

lastFlag = False # create a flag for duplicated rows
dupl = x.duplicated() # check which row is a duplicate
for i in range(len(dupl)): # looping into the list
    # get the first duplicate and print it
    if lastFlag != dupl[i]:
        lastFlag = dupl[i]
        if dupl[i]:
            print(x.iloc[i, :]) # this print in pandas.Series type

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM