I often try to do the following operation, but there's an immediate solution which is most efficient in pandas:
I have the following example pandas DataFrame, whereby there are two columns, Name
and Age
:
import pandas as pd
data = [['Alex',10],['Bob',12],['Barbara',25], ['Bob',72], ['Clarke',13], ['Clarke',13], ['Destiny', 45]]
df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)
print(df)
Name Age
0 Alex 10.0
1 Bob 12.0
2 Barbara 25.0
3 Bob 72.0
4 Clarke 13.0
5 Clarke 13.0
6 Destiny 45.0
I would like to remove all rows which do have a matching value in Name
. In the example df
, there are two Bob
values and two Clarke
values. The intended output would therefore be:
Name Age
0 Bob 12.0
1 Bob 72.0
2 Clarke 13.0
3 Clarke 13.0
whereby I'm assuming that there's a reset index.
One option would be to keep all unique values for Name
in a list, and then iterate through the dataframe to check for duplicate rows. That would be very inefficient.
Is there a built-in function for this task?
Use drop_duplicates
, and only get the ones that are dropped:
print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))])
Output:
Name Age
1 Bob 12.0
3 Bob 72.0
4 Clarke 13.0
5 Clarke 13.0
If care about the index, do:
print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))].reset_index(drop=1))
Output:
Name Age
0 Bob 12.0
1 Bob 72.0
2 Clarke 13.0
3 Clarke 13.0
Using duplicated
df[df.Name.duplicated(keep=False)]
Name Age
1 Bob 12.0
3 Bob 72.0
4 Clarke 13.0
5 Clarke 13.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.