简体   繁体   中英

Only Retain Unique Duplicates from Pandas Dataframe

EDIT: desired output for the example given:

first second third fourth fifth
1     2      3     4      5

EDIT 2: changed count() to size()

I've come across several instances when analyzing data where I'd like to return all duplicated rows, but only one row for each duplicate. I'm trying to do so within Pandas with Python 3.

Using groupby and count I can get the output I'm looking for, but it's not intuitive. The pandas "duplicated" function doesn't return the desired output as it returns multiple rows if there are more than two duplicates.

    data = [[1,2,3,4,5],
           [1,2,3,4,5],
           [1,2,3,4,5],
           [4,5,6,7,8]]

    x.columns = ['first','second','third','fourth','fifth']

    x = pd.DataFrame(data)

    x.groupby(list(x.columns)).size() > 1

The groupby function returns the desired dataframe output, while using

x[x.duplicated(keep='first')]

will still return duplicate rows. Is there a more pythonic way of only returning the unique duplicates?

Use

x.drop_duplicates()

first   second  third   fourth  fifth
0   1   2   3   4   5
3   4   5   6   7   8

You can chain what you select already with duplicated and then drop_duplicates such as:

print (x[x.duplicated()].drop_duplicates())
   first  second  third  fourth  fifth
1      1       2      3       4      5

You can still use .duplicated() to check whether the row is a duplicate or not. If it is a duplicate, then it will return True .

After that, we create a flag, and then do a looping to get the duplicated row only. Check my code for details how I did it.

import pandas as pd

data = [[1,2,3,4,5],
        [1,2,3,4,5],
        [1,2,3,4,5],
        [4,5,6,7,8]]

x = pd.DataFrame(data)
x.columns = ['first','second','third','fourth','fifth']

lastFlag = False # create a flag for duplicated rows
dupl = x.duplicated() # check which row is a duplicate
for i in range(len(dupl)): # looping into the list
    # get the first duplicate and print it
    if lastFlag != dupl[i]:
        lastFlag = dupl[i]
        if dupl[i]:
            print(x.iloc[i, :]) # this print in pandas.Series type

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM