简体   繁体   中英

Removing duplicates based on a condition pandas

When removing duplicates, can I keep those rows that match a condition? Instead of doing:

df.remove_duplicates(subset=['x','y'], keep='first']

do:

df.remove_duplicates(subset=['x','y'], keep=df.loc[df[column]=='String'])

Suppose I have a df like:

A  B

1  'Hi'
1  'Bye'

Keep the rows with 'Hi'. I want to do it this way because it would be more handful since I am going to introduce multiple conditions in the process

Use DataFrame.duplicated with invert mask and chain by & for bitwise AND by condition:

df['mask'] = ~df.duplicated(subset=['A','B']) & (df['B']=='Hi')
print (df)
   A    B   mask
0  1   Hi   True
1  1  Bye  False
2  1   Hi  False
3  1  Bye  False

Tested with duplciated index and working perfectly:

df.index = [0] * 4

df['mask'] = ~df.duplicated(subset=['A','B']) & (df['B']=='Hi')
print (df)
   A    B   mask
0  1   Hi   True
0  1  Bye  False
0  1   Hi  False
0  1  Bye  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM