简体   繁体   中英

How to only keep rows which have more than one value in a pandas DataFrame?

I often try to do the following operation, but there's an immediate solution which is most efficient in pandas:

I have the following example pandas DataFrame, whereby there are two columns, Name and Age :

import pandas as pd

data = [['Alex',10],['Bob',12],['Barbara',25], ['Bob',72], ['Clarke',13], ['Clarke',13], ['Destiny', 45]]

df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)

print(df)
      Name   Age
0     Alex  10.0
1      Bob  12.0
2  Barbara  25.0
3      Bob  72.0
4   Clarke  13.0
5   Clarke  13.0
6  Destiny  45.0

I would like to remove all rows which do have a matching value in Name . In the example df , there are two Bob values and two Clarke values. The intended output would therefore be:

      Name   Age
0      Bob  12.0
1      Bob  72.0
2   Clarke  13.0
3   Clarke  13.0

whereby I'm assuming that there's a reset index.

One option would be to keep all unique values for Name in a list, and then iterate through the dataframe to check for duplicate rows. That would be very inefficient.

Is there a built-in function for this task?

Use drop_duplicates , and only get the ones that are dropped:

print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))])

Output:

     Name   Age
1     Bob  12.0
3     Bob  72.0
4  Clarke  13.0
5  Clarke  13.0

If care about the index, do:

print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))].reset_index(drop=1))

Output:

     Name   Age
0     Bob  12.0
1     Bob  72.0
2  Clarke  13.0
3  Clarke  13.0

Using duplicated

df[df.Name.duplicated(keep=False)]
     Name   Age
1     Bob  12.0
3     Bob  72.0
4  Clarke  13.0
5  Clarke  13.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM