简体   繁体   中英

How to Conditionally Remove Duplicates from Pandas DataFrame with a List

I have a df , and want to remove all duplicates on ID .

       Name     Symbol         ID
0   ZOO INC     Remove  88579Y101
1   Zoo Inc        ZZZ  88579Y101
2     A Inc        AAA  90138A103
3     a inc.    Remove  90138A103
4    2U Inc       TWUO  90214J101
5      Keep     Remove  111111111

But I only want to remove the duplicate rows where Symbol == 'Remove' . The output should look like:

       Name     Symbol         ID
0   Zoo Inc        ZZZ  88579Y101
1     A Inc        AAA  90138A103
2    2U Inc       TWUO  90214J101
3      Keep     Remove  111111111

I can't use result_df = df.drop_duplicates(subset=['ID'], keep='first') (or keep='last' ) because the dataset doesn't have a specific pattern. And sorting alphabetically first won't help either.

And while I know I can replace all Remove with NaN , and then use the solution provided here , I am looking for an alternate solution because I may eventually need to pass a list of strings.

Does Pandas support anything like: result_df = df.drop_duplicates(subset=['ID'], keep=(df['Symbol'] != 'Remove')) ?

Use Series.duplicated with keep=False for all dupes and chain with compare for Remove , chain together by |for bitwise OR and invert mask by ~ :

m1 = df['ID'].duplicated(keep=False)
m2 = (df['Symbol'] == 'Remove')

df = df[~(m1 & m2)]

print (df)
      Name Symbol         ID
1  Zoo Inc    ZZZ  88579Y101
2    A Inc    AAA  90138A103
4   2U Inc   TWUO  90214J101
5     Keep     Remove  111111111

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM