简体   繁体   中英

How to delete duplicates pandas

I need to check if there are some duplicates value in one column of a dataframe using Pandas and, if there is any duplicate, delete the entire row. I need to check just the first column.

Example:

object    type

apple     fruit
ball      toy
banana    fruit
xbox      videogame
banana    fruit
apple     fruit

What i need is:

object    type

apple     fruit
ball      toy
banana    fruit
xbox      videogame

I can delete the 'object' duplicates with the following code, but I can't delete the entire row that contains the duplicate as the second column won't be deleted.


df = pd.read_csv(directory, header=None,)

objects= df[0]

for object in df[0]:
   

Select by duplicated mask and negate it

df = df[~df["object"].duplicated()]

Which gives

   object       type
0   apple      fruit
1    ball        toy
2  banana      fruit
3    xbox  videogame

use drop_duplicates method

d = pd.DataFrame(
    {'object': ['apple', 'ball', 'banana', 'xbox', 'banana', 'apple'],
    'type': ['fruit', 'toy', 'fruit', 'videogame', 'fruit', 'fruit']}
)
d.drop_duplicates()

there are several keyword args. that might come in handy (like inplace=True if you want your dataframe d to be updated)

You can use .drop_duplicates() with parameter subset='object' to select the column you want to check, as follows:

df_out = df.drop_duplicates(subset='object')

Result:

print(df_out)

   object       type
0   apple      fruit
1    ball        toy
2  banana      fruit
3    xbox  videogame

To get the length after dropping duplicates

df = len(df)-len(df.drop_duplicates())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM